Top Banner
1 Code Optimization
119

Code Optimization

Jan 26, 2016

Download

Documents

Ahmed Akrout

Code Optimization. Outline. Optimizing Blockers Memory alias Side effect in function call Understanding Modern Processor Super-scalar Out-of –order execution More Code Optimization techniques Performance Tuning Suggested reading 5.1, 5.7 ~ 5.16. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Code Optimization

1

Code Optimization

Page 2: Code Optimization

2

Outline

• Optimizing Blockers– Memory alias– Side effect in function call

• Understanding Modern Processor– Super-scalar– Out-of –order execution

• More Code Optimization techniques• Performance Tuning

• Suggested reading

– 5.1, 5.7 ~ 5.16

Page 3: Code Optimization

3

5.1 Capabilities and Limitations of Optimizing Compliers

Review on5.3 Program Example5.4 Eliminating Loop Inefficiencies5.5 Reducing Procedure Calls5.6 Eliminating Unneeded Memory References

Page 4: Code Optimization

4

void combine1(vec_ptr v, data_t *dest){ int i; *dest = IDENT;

for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; }}

Example P387

Page 5: Code Optimization

5

void combine2(vec_ptr v, int *dest){ int i; int length = vec_length(v); *dest = IDENT;

for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; }}

Example P388

Page 6: Code Optimization

6

void combine3(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v);

*dest = IDENT; for (i = 0; i < length; i++) { *dest = *dest OPER data[i];}

Example P392

Page 7: Code Optimization

7

void combine4(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT;

for (i = 0; i < length; i++) x = x OPER data[i]; *dest = x;}

Example P394

Page 8: Code Optimization

8

Machine Independent Opt. Results

• Optimizations– Reduce function calls and memory references

within loop

Page 9: Code Optimization

9

Machine Independent Opt. Results

• Performance Anomaly– Computing FP product of all elements exceptionally slow.– Very large speedup when accumulate in temporary– Memory uses 64-bit format, register use 80– Benchmark data caused overflow of 64 bits, but not 80

Integer Floating Point Method + * + *

Abstract -g 42.06 41.86 41.44 160.00 Abstract -O2 31.25 33.25 31.25 143.00 Move vec_length 22.61 21.25 21.15 135.00 data access 6.00 9.00 8.00 117.00 Accum. in temp 2.00 4.00 3.00 5.00

Combine4Combine3Combine2

Combine1Combine1

P385P388P392

P394

Page 10: Code Optimization

10

Optimization Blockers P394

void combine4(vec_ptr v, int *dest)

{

int i;

int length = vec_length(v);

int *data = get_vec_start(v);

int sum = 0;

for (i = 0; i < length; i++)

sum += data[i];

*dest = sum;

}

Page 11: Code Optimization

11

Optimization Blocker: Memory Aliasing P394

• Aliasing– Two different memory references specify single

location

• Example

– v: [3, 2, 17]

– combine3(v, get_vec_start(v)+2) -->

?

– combine4(v, get_vec_start(v)+2) -->

?

Page 12: Code Optimization

12

Optimization Blocker: Memory Aliasing

• Observations

– Easy to have happen in C

• Since allowed to do address arithmetic

• Direct access to storage structures

– Get in habit of introducing local variables

• Accumulating within loops

• Your way of telling compiler not to check for aliasing

Page 13: Code Optimization

13

Optimizing Compilers

• Provide efficient mapping of program to

machine

– register allocation

– code selection and ordering

– eliminating minor inefficiencies

Page 14: Code Optimization

14

Optimizing Compilers

• Don’t (usually) improve asymptotic efficiency– up to programmer to select best overall algorithm

– big-O savings are (often) more important than constant factors

• but constant factors also matter

• Have difficulty overcoming “optimization blockers”– potential memory aliasing

– potential procedure side-effects

Page 15: Code Optimization

15

Limitations of Optimizing Compilers

• Operate Under Fundamental Constraint

– Must not cause any change in program

behavior under any possible condition

– Often prevents it from making optimizations

when would only affect behavior under

pathological conditions.

Page 16: Code Optimization

16

Limitations of Optimizing Compilers

• Behavior that may be obvious to the programmer

can be obfuscated by languages and coding

styles

– e.g., data ranges may be more limited than variable

types suggest

• e.g., using an “int” in C for what could be an enumerated

typeobfuscated:混乱

Page 17: Code Optimization

17

Limitations of Optimizing Compilers

• Most analysis is performed only within procedures

– whole-program analysis is too expensive in most cases

• Most analysis is based only on static information

– compiler has difficulty anticipating run-time inputs

• When in doubt, the compiler must be conservative

Page 18: Code Optimization

18

Optimization Blockers P380

• Memory aliasing void twiddle1(int *xp, int *yp) {

*xp += *yp ;

*xp += *yp ;

}

void twiddle2(int *xp, int *yp)

{

*xp += 2* *yp ;

}

Page 19: Code Optimization

19

Optimization Blockers P381

• Function call and side effectint f(int) ;

int func1(x)

{

return f(x)+f(x)+f(x)+f(x) ;

}

int func2(x)

{

return 4*f(x) ;

}

Page 20: Code Optimization

20

Optimization Blockers P381

• Function call and side effectint counter = 0 ;

int f(int x)

{

return counter++ ;

}

Page 21: Code Optimization

21

5.7 Understanding Modern Processors5.7.1 Overall Operation

Page 22: Code Optimization

22

Modern CPU Design Figure 5.11 P396

ExecutionExecution

FunctionalUnits

Instruction ControlInstruction Control

Integer/Branch

FPAdd

FPMult/Div

Load Store

InstructionCache

DataCache

FetchControl

InstructionDecode

Address

Instructions

Operations

PredictionOK?

DataData

Addr. Addr.

GeneralInteger

Operation Results

RetirementUnit

RegisterFile

RegisterUpdates

Page 23: Code Optimization

23

Retirement

Unit

Register

File

Instruction

Cache

Fetch Control

Instruction

Decode

Address

Instructions

Integer

/branch

General

Integer

FP

Add

FP

mult/div

Load Store

Fu

nctional un

its

operations

Predication OK?

Data Cache

Operation resultsaddr

dataaddr

data

Register

Updates

1)2)

3)

4)

5)

(1) (2) (3) (4) (5) (6)

(7)

Page 24: Code Optimization

24

Modern Processor P396

• Superscalar

– Perform multiple operations on every clock cycle

• Out-of-order execution

– The order in which the instructions execute need

not correspond to their ordering in the assembly

program

Page 25: Code Optimization

25

Modern Processor P396

• Two main parts

– Instruction Control Unit

• Responsible for reading a sequence of instructions

from memory

• Generating from above instructions a set of primitive

operations to perform on program data

– Execution Unit

Page 26: Code Optimization

26

1) Instruction Control Unit

• Instruction Cache– A special, high speed memory containing the

most recently accessed instructions.

Page 27: Code Optimization

27

1) Instruction Control Unit

• Instruction Decoding Logic– Take actual program instructions– Converts them into a set of primitive operations– Each primitive operation performs some simple

task• Simple arithmetic, Load, Store• addl %eax, 4(%edx) --- three operations

load 4(%edx) t1addl %eax, t1 t2store t2, 4(%edx)

– Register renaming

P397

P398

Page 28: Code Optimization

28

2) Fetch Control

• Fetch Ahead P396

– Fetches well ahead of currently accessed

instructions

– ICU has enough time to decode these

– ICU has enough time to send decoded

operations down to the EU

Page 29: Code Optimization

29

Fetch Control

• Branch Predication P397– Branch taken or fall through

– Guess whether branch is taken or not

• Speculative Execution P397– Fetch, decode and execute only according to

the branch prediction

– Before the branch predication has been determined

Page 30: Code Optimization

30

5.7 Understanding Modern Processors5.7.2 Functional Unit Performance

Page 31: Code Optimization

31

Multi-functional Units

• Multiple Instructions Can Execute in

Parallel

– 1 load

– 1 store

– 2 integer (one may be branch)

– 1 FP Addition

– 1 FP Multiplication or Division

Page 32: Code Optimization

32

Multi-functional Units Figure 5.12 P400

• Some Instructions Take > 1 Cycle, but Can be Pipelined– Instruction Latency Cycles/Issue– Load / Store 3 1– Integer Multiply 4 1– Integer Divide 36 36– Double/Single FP Multiply 5 2– Double/Single FP Add 3 1– Double/Single FP Divide38 38

Page 33: Code Optimization

33

5.7 Understanding Modern Processors5.7.1 Overall Operation

Page 34: Code Optimization

34

Execution Unit

• Receives operations from ICU

• Each cycle it may receive more than one

operation

• Operations are queued in buffer

Page 35: Code Optimization

35

Execution Unit

• Operation is dispatched to one of multi-functional units, whenever– All the operands of an operation are ready– Suitable functional units are available

• Execution results are passed among functional units

• (7) Data Cache P398– A high speed memory containing the most

recently accessed data values

Page 36: Code Optimization

36

4) Retirement Unit P398

• Instructions need to commit in serial order

– Misprediction

– Exception

• Updates Architecture status

– Memory and register values

Page 37: Code Optimization

37

5.7.3 A Closer Look at Processor OperationTranslation Instruction into Operations

Page 38: Code Optimization

38

Translation Example P401

.L24: # Loop:imull (%eax,%edx,4),%ecx # t *= data[i]incl %edx # i++cmpl %esi,%edx # i:lengthjl .L24 # if < goto Loop

.L24:

imull (%eax,%edx,4),%ecx

incl %edxcmpl %esi,%edxjl .L24

load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1incl %edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1

Page 39: Code Optimization

39

Understanding Translation Example P401

• Split into two operations– Load reads from memory to generate

temporary result t.1– Multiply operation just operates on registers

imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1

Page 40: Code Optimization

40

Understanding Translation Example P401

• Operands

– Registers %eax does not change in loop.

Values will be retrieved from register file during

decoding

imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1

Page 41: Code Optimization

41

Understanding Translation Example P401

• Operands

– Register %ecx changes on every iteration.

– Uniquely identify different versions as

• %ecx.0, %ecx.1, %ecx.2, …

– Register renaming

• Values passed directly from producer to consumers

imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1

Page 42: Code Optimization

42

Understanding Translation Example P402

• Register %edx changes on each iteration

• Renamed as %edx.0, %edx.1, %edx.2, …

incl %edx incl %edx.0 %edx.1

Page 43: Code Optimization

43

Understanding Translation Example P402

• Condition codes are treated similar to

registers

• Assign tag to define connection between

producer and consumer

cmpl %esi,%edx cmpl %esi, %edx.1 cc.1

Page 44: Code Optimization

44

Understanding Translation Example P402

• Instruction control unit determines

destination of jump

• Predicts whether target will be taken

• Starts fetching instruction at predicted

destination

jl .L24 jl-taken cc.1

Page 45: Code Optimization

45

Understanding Translation Example P401

• Execution unit simply checks whether or not prediction was OK

• If not, it signals instruction control– Instruction control then “invalidates” any

operations generated from misfetched instructions

– Begins fetching and decoding instructions at correct target

jl .L24 jl-taken cc.1

Page 46: Code Optimization

46

• Operations– Vertical position denotes

time at which executed• Cannot begin operation

until operands available– Height denotes latency

• Operands– Arcs shown only for operands

that are passed within execution unit

cc.1

t.1

load

%ecx.1

incl

cmpl

jl

%edx.0

%edx.1

%ecx.0

imull

load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1incl %edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1

Time

Visualizing Operations Figure 5.13 P403

Page 47: Code Optimization

47

• Operations– Same as before,

except that add has latency of 1

load (%eax,%edx,4) t.1iaddl t.1, %ecx.0 %ecx.1incl %edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1

Time

cc.1

t.1

%ecx.i +1

incl

cmpl

jl

load

%edx.0

%edx.1

%ecx.0

addl%ecx.1

load

Visualizing Operations Figure 5.14 P403

Page 48: Code Optimization

48

cc.1

cc.2%ecx.0

%edx.3t.1

imull

%ecx.1

incl

cmpl

jl

%edx.0

i=0

load

t.2

imull

%ecx.2

incl

cmpl

jl

%edx.1

i=1

load

cc.3

t.3

imull

%ecx.3

incl

cmpl

jl

%edx.2

i=2

load

Cycle

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

cc.1

cc.2

Iteration 3

Iteration 2

Iteration 1

cc.1

cc.2%ecx.0

%edx.3t.1

imull

%ecx.1

incl

cmpl

jl

%edx.0

i=0

load

t.1

imull

%ecx.1

incl

cmpl

jl

%edx.0

i=0

load

t.2

imull

%ecx.2

incl

cmpl

jl

%edx.1

i=1

load

t.2

imull

%ecx.2

incl

cmpl

jl

%edx.1

i=1

load

cc.3

t.3

imull

%ecx.3

incl

cmpl

jl

%edx.2

i=2

load

Cycle

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Cycle

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

cc.1

cc.2

Iteration 3

Iteration 2

Iteration 1

• Unlimited Resource Analysis– Assume operation

can start as soon as operands available

– Operations for multiple iterations overlap in time

• Performance– Limiting factor

becomes latency of integer multiplier

– Gives CPE of 4.0

3 Iterations of Combining Product Figure 5.15 P404

Page 49: Code Optimization

49

• Unlimited Resource Analysis

• Performance

– Can begin a new iteration on each clock cycle

– Should give CPE of 1.0

– Would require executing 4 integer operations in parallel

%edx.0

t.1

%ecx.i +1

incl

cmpl

jl

addl%ecx.1

i=0

loadcc.1

%edx.0

t.1

%ecx.i +1

incl

cmpl

jl

addl%ecx.1

i=0

loadcc.1

%edx.1

t.2

%ecx.i +1

incl

cmpl

jl

addl%ecx.2

i=1

loadcc.2

%edx.1

t.2

%ecx.i +1

incl

cmpl

jl

addl%ecx.2

i=1

loadcc.2

%edx.2

t.3

%ecx.i +1

incl

cmpl

jl

addl%ecx.3

i=2

loadcc.3

%edx.2

t.3

%ecx.i +1

incl

cmpl

jl

addl%ecx.3

i=2

loadcc.3

%edx.3

t.4

%ecx.i +1

incl

cmpl

jl

addl%ecx.4

i=3

loadcc.4

%edx.3

t.4

%ecx.i +1

incl

cmpl

jl

addl%ecx.4

i=3

loadcc.4

%ecx.0

%edx.4

Cycle

1

2

3

4

5

6

7

Cycle

1

2

3

4

5

6

7

Iteration 1

Iteration 2

Iteration 3

Iteration 4

4 integer ops

4 Iterations of Combining Sum Figure 5.16 P405

Page 50: Code Optimization

50

Combining Product: Resource Constraints Figure 5.17 P406

• Figure 5.17 P406

Page 51: Code Optimization

51

Iteration 4

Iteration 5

Iteration 6

Iteration 7

Iteration 8

%ecx.3

%edx.8

%edx.3

t.4%ecx.i +1

incl

cmpl

jladdl

%ecx.4

i=3

load

cc.4

%edx.3

t.4%ecx.i +1

incl

cmpl

jladdl

%ecx.4

i=3

load

cc.4

%edx.4

t.5%ecx.i +1

incl

cmpl

jladdl%ecx.5

i=4

load

cc.5

%edx.4

t.5%ecx.i +1

incl

cmpl

jladdl%ecx.5

i=4

load

cc.5

cc.6

%edx.7

t.8%ecx.i +1

incl

cmpl

jladdl

%ecx.8

i=7

load

cc.8

%edx.7

t.8%ecx.i +1

incl

cmpl

jladdl

%ecx.8

i=7

load

cc.8

%edx.5

t.6

incl

cmpl

jl

addl

%ecx.6

load

i=5

%edx.5

t.6

incl

cmpl

jl

addl

%ecx.6

load

i=5

6

7

8

9

10

11

12

Cycle

13

14

15

16

17

6

7

8

9

10

11

12

Cycle

13

14

15

16

17

18

cc.6

%edx.6

t.7

cmpl

jl

addl

%ecx.7

load

cc.7

i=6

incl

%edx.6

t.7

cmpl

jl

addl

%ecx.7

load

cc.7

i=6

incl

Combining Sum: Resource Constraints Figure 5.18 P408

Page 52: Code Optimization

52

Combining Sum: Resource Constraints

• Only have two integer functional units• Some operations delayed even though

operands available• Set priority based on program order• Performance

– Sustain CPE of 2.0

Page 53: Code Optimization

53

5.8 Reducing Loop Overhead

Page 54: Code Optimization

54

Loop unrolling P409

void combine5(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT;

/* combine 3 elements at a time */ for (i = 0; i < length-2; i+=3) x = x OPER data[i] OPER data[i+1] OPER data[i+2];

/* finish any remaining elements */ for (; i < length; i++) x = x OPER data[i]; *dest = x;}

Page 55: Code Optimization

55

– Loads can pipeline, since don’t have dependencies

– Only one set of loop control operations

load (%eax,%edx.0,4) t.1aiaddl t.1a, %ecx.0c %ecx.1aload 4(%eax,%edx.0,4) t.1biaddl t.1b, %ecx.1a %ecx.1bload 8(%eax,%edx.0,4) t.1ciaddl t.1c, %ecx.1b %ecx.1ciaddl $3,%edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1

Visualizing Unrolled Loop P410

Page 56: Code Optimization

56

Time

%edx.0

%edx.1

%ecx.0c

cc.1

t.1a

%ecx.i +1

addl

cmpl

jl

addl

%ecx.1c

addl

addl

t.1b

t.1c

%ecx.1a

%ecx.1b

load

load

load

Measured CPE = 1.33

Visualizing Unrolled Loop Figure 5.20 P410

Page 57: Code Optimization

57

i=6

cc.3

t.3a

%ecx.i +1

addl

cmpl

jl

addl

%ecx.3c

addl

addl

t.3b

t.3c

%ecx.3a

%ecx.3b

load

load

load

%ecx.2c

i=9

cc.4

t.4a

%ecx.i +1

addl

cmpl

jl

addl

%ecx.4c

addl

addl

t.4b

t.4c

%ecx.4a

%ecx.4b

load

load

load

cc.4

t.4a

%ecx.i +1

addl

cmpl

jl

addl

%ecx.4c

addl

addl

t.4b

t.4c

%ecx.4a

%ecx.4b

load

load

load

%edx.3

%edx.2

%edx.4

5

6

7

8

9

10

11

Cycle

12

13

14

15

5

6

7

8

9

10

11

Cycle

12

13

14

15

Iteration 3

Iteration 4

Executing with Loop Unrolling Figure 5.21 P411

Page 58: Code Optimization

58

Executing with Loop Unrolling

• Predicted Performance– Can complete iteration in 3 cycles

– Should give CPE of 1.0

• Measured Performance– CPE of 1.33

– One iteration every 4 cycles

Page 59: Code Optimization

59

Unrolling Degree

1 2 3 4 8 16

Integer

Sum 2.00 1.50 1.33 1.50 1.25 1.06

Integer

Product 4.00

FP Sum 3.00

FP Product 5.00

Effect of Unrolling P411

Page 60: Code Optimization

60

Effect of Unrolling

• Only helps integer sum for our examples

– Other cases constrained by functional unit

latencies

• Effect is nonlinear with degree of

unrolling

– Many subtle effects determine exact

scheduling of operations

Page 61: Code Optimization

61

5.9 Converting to Pointer Code

Page 62: Code Optimization

62

void combine4p(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int *dend = data + length ; int x = IDENT;

for (; data < dend ; data++ ) x = x OPER *data; *dest = x;}

Example P413

Page 63: Code Optimization

63

• Some compilers and processors do better job optimizing array code

Function Integer Floating pointer

+ * + *

Combine4 2.00 4.00 3.00 5.00

Combine4p 3.00 4.00 3.00 5.00

Pointer Code vs. Array Code P414

Page 64: Code Optimization

64

.L24: # Loop:addl (%eax,%edx,4),%ecx # x += data[i]incl %edx # i++cmpl %esi,%edx # i:lengthjl .L24 # if < goto Loop

.L30: # Loop:addl (%eax),%ecx # x += *dataaddl $4,%eax # data ++cmpl %edx,%eax # data:dendjb .L30 # if < goto Loop

Pointer vs. Array Code Inner Loops P414

Page 65: Code Optimization

65

• Performance– Array Code: 4 instructions in 2 clock cycles– Pointer Code: Almost same 4 instructions in 3

clock cycles

Pointer vs. Array Code Inner Loops

Page 66: Code Optimization

66

5.10 Enhancing Parallelism

Page 67: Code Optimization

67

void combine6(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int x0 = IDENT, x1 = IDENT;

/* combine 2 elements at a time */ for (i = 0; i < length; i+=2){ x0 = x0 OPER data[i]; x1 = x1 OPER data[i+1]; }

/* finish any remaining elements */ for (; i < length; i++) x0 = x0 OPER data[i]; *dest = x0 OPER x1;}

Loop Splitting P416

Page 68: Code Optimization

68

Loop Splitting

• Optimization– Accumulate in two different sums

• Can be performed simultaneously

– Combine at end

– Exploits property that integer addition & multiplication are associative & commutative

– FP addition & multiplication not associative, but transformation usually acceptable

Associative:可结合的Commutative:可交换的

Page 69: Code Optimization

69

load (%eax,%edx.0,4) t.1aimull t.1a, %ecx.0 %ecx.1load 4(%eax,%edx.0,4) t.1bimull t.1b, %ebx.0 %ebx.1iaddl $2,%edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1

Visualizing Parallel Loop P417

• Two multiplies within loop no longer have data dependency

• Allows them to pipeline

Page 70: Code Optimization

70

Time

%edx.1

%ecx.0

%ebx.0

cc.1

t.1a

imull

%ecx.1

addl

cmpl

jl

%edx.0

imull

%ebx.1

t.1b

load

load

Visualizing Parallel Loop Figure 5.25 P417

Page 71: Code Optimization

71

%edx.3%ecx.0

%ebx.0

i=0

i=2

cc.1

t.1a

imull

%ecx.1

addl

cmpl

jl

%edx.0

imull

%ebx.1

t.1b

load

loadcc.1

t.1a

imull

%ecx.1

addl

cmpl

jl

%edx.0

imull

%ebx.1

t.1b

load

loadcc.2

t.2a

imull

%ecx.2

addl

cmpl

jl

%edx.1

imull

%ebx.2

t.2b

load

loadcc.2

t.2a

imull

%ecx.2

addl

cmpl

jl

%edx.1

imull

%ebx.2

t.2b

load

load

i=4

cc.3

t.3a

imull

%ecx.3

addl

cmpl

jl

%edx.2

imull

%ebx.3

t.3b

load

load

14

Cycle

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Cycle

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Iteration 1

Iteration 2

Iteration 3

Executing with Parallel Loop Figure 5.26 P418

Page 72: Code Optimization

72

Integer Floating Point Method + * + *

Unroll 4 1.50 4.00 3.00 5.00 Unroll 16 1.06 4.00 3.00 5.00 2 X 2 1.50 2.00 2.00 2.50 4 X 4 1.50 2.00 1.50 2.50 8 X 4 1.25 1.25 1.50 2.00 Theoretical Opt. 1.00 1.00 1.00 2.00

Optimization Results for Combining P419

Page 73: Code Optimization

73

Optimization Results for Combining

• Register spilling

– only 6 registers available

– Using memory as storage

• Register spilling

– movl -12(%ebp), %edi

– imull 24(%eax), %edi

– movl %edi, -12(%ebp)

Page 74: Code Optimization

74

5.11 Putting it Together: Summary of Results for Optimizing Combining Code

5.11.1 Floating-Point Performance Anomaly5.11.2 Changing Platforms

Page 75: Code Optimization

75

5.12 Branch Prediction and Misprediction Penalties

Page 76: Code Optimization

76

What About Branches?

• Challenge– Instruction Control Unit must work well ahead

of Exec. Unit– To generate enough operations to keep EU

busy

Page 77: Code Optimization

77

80489f3: movl $0x1,%ecx 80489f8: xorl %edx,%edx 80489fa: cmpl %esi,%edx 80489fc: jnl 8048a25 80489fe: movl %esi,%esi 8048a00: imull (%eax,%edx,4),%ecx

Executing

Fetching &Decoding

What About Branches?

Page 78: Code Optimization

78

What About Branches?

• Challenge

– When encounters conditional branch, cannot

reliably determine where to continue fetching

Page 79: Code Optimization

79

Branch Outcomes

• When encounter conditional branch,

cannot determine where to continue

fetching

– Branch Taken: Transfer control to branch target

– Branch Not-Taken: Continue with next

instruction in sequence

• Cannot resolve until outcome determined

by branch/integer unit

Page 80: Code Optimization

80

80489f3: movl $0x1,%ecx 80489f8: xorl %edx,%edx 80489fa: cmpl %esi,%edx 80489fc: jnl 8048a25 80489fe: movl %esi,%esi 8048a00: imull (%eax,%edx,4),%ecx

8048a25: cmpl %edi,%edx 8048a27: jl 8048a20 8048a29: movl 0xc(%ebp),%eax 8048a2c: leal 0xffffffe8(%ebp),%esp 8048a2f: movl %ecx,(%eax)

Branch Taken

Branch Not-Taken

Branch Outcomes

Page 81: Code Optimization

81

Branch Prediction

• Idea

– Guess which way branch will go

– Begin executing instructions at predicted

position

• But don’t actually modify register or memory data

Page 82: Code Optimization

82

80489f3: movl $0x1,%ecx 80489f8: xorl %edx,%edx 80489fa: cmpl %esi,%edx 80489fc: jnl 8048a25 . . .

8048a25: cmpl %edi,%edx 8048a27: jl 8048a20 8048a29: movl 0xc(%ebp),%eax 8048a2c: leal 0xffffffe8(%ebp),%esp 8048a2f: movl %ecx,(%eax)

Predict Taken

Execute

Branch Prediction

Page 83: Code Optimization

83

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

i = 98

i = 99

i = 100

Predict Taken (OK)

Predict Taken(Oops)

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

i = 101

Assume vector length = 100

Read invalid location

Executed

Fetched

Branch Prediction Through Loop

Page 84: Code Optimization

84

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

i = 98

i = 99

i = 100

Predict Taken (OK)

Predict Taken (Oops)

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx i = 101

Invalidate

Assume vector length = 100

Branch Misprediction Invalidation

Page 85: Code Optimization

85

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 80488bb: leal 0xffffffe8(%ebp),%esp 80488be: popl %ebx 80488bf: popl %esi 80488c0: popl %edi

i = 98

i = 99

Predict Taken (OK)

Definitely not taken

Assume vector length = 100

Branch Misprediction Recovery

Page 86: Code Optimization

86

Branch Misprediction Recovery P427

• Performance Cost

– Misprediction on Pentium III wastes ~14 clock

cycles

– That’s a lot of time on a high performance

processor

Page 87: Code Optimization

87

• Misprediction penalty is about 14 cycles in PIII

machine

• Conditional mov is used to avoid the

misprediction penalty when the branch outcome

is not predictable

• For example: int absval(int val) {

return (val <0)? –val : val

}

Conditional Jump Figure 5.29 P427

Page 88: Code Optimization

88

Conditional Jump P428

movl 8(%ebp), %eax Get val as result

movl %eax, %edx Copy to %edx

negl %edx Negate %edx

testl %eax, %eax Test Val

cmov1 %edx, %eax if <0 copy %edx to result

Page 89: Code Optimization

89

5.13 Understanding Memory Performance

Page 90: Code Optimization

90

typedef struct ELE {

struct ELE *next ;

int data ;

} list_ele, *list_ptr ;

int list_len(list_ptr ls)

{

int len = 0 ;

for (;ls;ls=ls->next)

len++ ;

return len ;

}

Assembly Instructions.L27:

incl %eaxmovl (%edx), %edxtestl %edx, %edxjne .L27

Execution unit operationsincl %eax.0 %eax.1load (%edx.0) %edx.1testl %edx.1, %edx.1 cc.1jne-taken cc.1

Load Latency P429, P430

Figure 5.30 P430

Page 91: Code Optimization

91

incl

testl

jne

%eax.0

%edx.0

incl

testl

jne

load

incl

testl

jne

load

load

%eax.1

%eax.2

%eax.3%edx.1

%edx.2

%edx.3

cc.1

cc.2

cc.3

i=1

i=2

i=3

1

2

3

4

5

6

7

8

9

10

11

Figure 5.31 P430

Page 92: Code Optimization

92

Store Latency Figure 5.32 P431

void array_clear(int *dest, int n)

{

int i;

for ( i = 0 ; i < n ; i++)

dest[i] = 0 ;

}

CPE 2.0

Page 93: Code Optimization

93

Store Latency Figure 5.32 P431

void array_clear(int *dest, int n) {

int i;int len = n-7 ;

for ( i = 0 ; i < len ; i++) {dest[i] = dest[i+1] = dest[i+2] = dest[i+3] = 0 ;dest[i+4] = dest[i+5] = dest[i+6] = dest[i+7] =

0 ;}for ( ; i < n ; i++)

dest[i] = 0 ; }CPE 1.25

Page 94: Code Optimization

94

Store latency Figure 5.33 P432

void write_read(int *src, int *dest, int n)

{

int cnt = n;

int val = 0;

while (cnt--) {

*dest = val;

val = (*src)+1;

}

}

Page 95: Code Optimization

95

Store latency Figure 5.33 P432

write_read(&a[0], &a[1], 3) initial iter. 1 iter. 2 iter. 3 cnt 3 2 1 0 a (-10, 17) (-10, 0) (-10, -9) (-10, -9) val 0 -9 -9 -9

write_read(&a[0], &a[0], 3)initial iter. 1 iter. 2 iter. 3

cnt 3 2 1 0 a (-10, 17) (0, 17) (1, 17)

(2, 17) val 0 1 2 3

Page 96: Code Optimization

96

Store latency

void write_read(int *src, int *dest, int n)

{

int cnt = n;

int val = 0;

while (cnt--) {

*dest = val;

val = (*src)+1;

}

}

Page 97: Code Optimization

97

Store latency P434

.L32:movl %edx, (%ecx)movl (%ebx), %edxincl %edxdecl %eaxjnc .L32

storeaddr (%ecx)storedata %edx.0load (%ebx) %edx.1aincl %edx.1a %edx.1bdecl %eax.0 %eax.1jnc-taken cc.1

Page 98: Code Optimization

98

%eax.2

1

2

3

4

5

6

7

decl

storedata

storeaddr

load

incl

jnc decl

storedata

storeaddr

load

incl

jnc

%edx.1a

%edx.1b

cc.1 %eax.1

%edx.2a

%edx.2b

%eax.0

%edx.0

Store latency Figure 5.35 P434

Page 99: Code Optimization

99

1

2

3

4

5

6

7

decl

storedata

storeaddr

load

incl

jnc decl

Storedata

storeaddr

incl

jnc

=

%edx.1a

%edx.1b

cc.1 %eax.1

=

%edx.2b

%eax.2

%eax.0

%edx.0

load

%edx.2a

Figure 5.36 P435

Page 100: Code Optimization

100

5.14 Life in the Real World: Performance Improvement Techniques

Page 101: Code Optimization

101

Basic Strategies for Optimizing Program Performance

• High-level design• Basic coding principles

• Eliminate excessive function calls• Eliminate unnecessary memory

references• Low-level optimizations

• Try various forms of pointer versus array code.

• Reduce loop overhead by unrolling loops.• Find ways to make use of the pipelined

functional units by techniques such as iteration splitting

Page 102: Code Optimization

102

5.15 Identifying and Eliminating Performance Bottlenecks

Page 103: Code Optimization

103

Performance Tuning

• Identify – Which is the hottest part of the program– Using a very useful method profiling

• Instrument the program• Run it with typical input data• Collect information from the result• Analysis the result

– gprof example• $gcc –O2 –pg prog.c –o prog• $prog file.text (generate new file gmon.out)• $gprof prog (with gmon.out)

Page 104: Code Optimization

104

Example

• Task– Count word frequencies in text document– Sort the words in descending order of

occurence

• Steps– Convert strings to lower case– Apply hash function– Read words and insert into hash table

• Mostly list operations• Maintain counter for each unique word

– Sort results

Page 105: Code Optimization

105

Examples

unix> gcc –O2 –pg prog.c –o prog

unix> ./prog file.txt

unix> gprof prog

% cumulative self self total

time seconds seconds calls ms/call ms/call name

86.60 8.21 8.21 1 8210.00 8210.00 sort_words

5.80 8.76 0.55 946596 0.00 0.00 lower1

4.75 9.21 0.45 946596 0.00 0.00 find_ele_rec

1.27 9.33 0.12 946596 0.00 0.00 h_add

Page 106: Code Optimization

106

Branch Misprediction Recovery

• Performance Cost

– Misprediction on Pentium III wastes ~14 clock

cycles

– That’s a lot of time on a high performance

processor

Page 107: Code Optimization

107

4872758 find_ele_rec [5]

0.60 0.01 946596/946596 insert_string [4]

[5] 6.7 0.60 0.01 946596+4872758 find_ele_rec [5]

0.00 0.01 26946/26946 save_string [9]

0.00 0.00 26946/26946 new_ele [11]

4872758 find_ele_rec [5]

Example P439

Page 108: Code Optimization

108

Principle

• Interval counting

– Maintain a counter for each function

• Record the time spent executing this function

– Interrupted at regular time (1ms)

• Check which function is executing when interrupt

occurs

• Increment the counter for this function

Page 109: Code Optimization

109

Data Set P439

• Collected works of Shakespeare• 946,596 total words, 26,596 unique• Initial implementation: 9.2 seconds

Page 110: Code Optimization

110

Code Optimizations

– First step: Use more efficient sorting function– Library function qsort

0

1

2

3

4

5

6

7

8

9

10

Initial Quicksort Iter First Iter Last Big Table Better Hash Linear Lower

CP

U S

ecs.

Rest

Hash

Lower

List

Sort

Figure 5.37 P441

1) 2) 3) 4) 5) 6) 7)

Page 111: Code Optimization

111

Further Optimizations

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Initial Quicksort Iter First Iter Last Big Table Better Hash Linear Low er

CPU

Secs

.

Rest

Hash

Low er

List

Sort

2) 3) 4) 5) 6) 7)1)

Page 112: Code Optimization

112

Example

• 3) Iter first: Use iterative function to insert elements in linked list– Causes code to slow down

• 4) Iter last: Iterative function, places new entry at end of list– Tend to place most common words at front of list

• 5) Big table: Increase number of hash buckets

• 6) Better hash: Use more sophisticated hash function

• 7) Linear lower: Move strlen out of loop

Page 113: Code Optimization

113

Code Motion Example#2void lower(char *s){ int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}

void lower(char *s){ int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop; done:}

Page 114: Code Optimization

114

Lower Case Conversion Performance

– Time quadruples when double string length– Quadratic performance

lower1

0.0001

0.001

0.01

0.1

1

10

100

1000

256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

String Length

CP

U S

eco

nd

s

Page 115: Code Optimization

115

• Time quadruples when double string length• Quadratic performance

lower1

0.0001

0.001

0.01

0.1

1

10

100

1000

256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

String Length

CP

U S

eco

nd

s

Lower Case Conversion Performance

Page 116: Code Optimization

116

• Move call to strlen outside of loop• Since result does not change from one iteration to another• Form of code motion

void lower(char *s){ int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}

Improving Performance

Page 117: Code Optimization

117

Lower Case Conversion Performance

– Time doubles when double string length– Linear performance

0.000001

0.00001

0.0001

0.001

0.01

0.1

1

10

100

1000

256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

String Length

CP

U S

eco

nd

s

lower1 lower2

Page 118: Code Optimization

118

• Benefits– Helps identify performance bottlenecks

– Especially useful when have complex system with many components

• Limitations– Only shows performance for data tested

– E.g., linear lower did not show big gain, since words are short

• Quadratic inefficiency could remain lurking in code

– Timing mechanism fairly crude• Only works for programs that run for > 3 seconds

Performance Tuning

Page 119: Code Optimization

119

Tnew = (1-)Told + (Told)/k

= Told[(1-) + /k]

S = Told / Tnew = 1/[(1-) + /k]

S = 1/(1-)

Amdahl’s Law P443