More Code Optimization

1

More Code Optimization

2

Outline

• Memory Performance• Tuning Performance

• Suggested reading

– 5.12 ~ 5.14

3

Load Performance

• load unit can only initiate one load operation every clock cycle (Issue=1.0)

typedef struct ELE {struct ELE *next ;int data ;

} list_ele, *list_ptr ;

int list_len(list_ptr ls) {int len = 0 ;while (ls) {

len++ ;ls = ls->next;

} return len ;

}

len in %eax, ls in %rdi.L11:

addl $1, %eaxmovq (%rdi), %rditestq %rdi, %rdijne .L11

Function CPElist_len 4.0

load latency 4.0

4

Store Performance

• store unit can only initiate one store operation

every clock cycle (Issue=1.0)void array_cleararray_clear(int *dest, int n) {

int i;for (i = 0; i < n; i++)

dest[i] = 0;}

Function CPEarray_clear 2.0

5

Store Performance

• store unit can only initiate one store operation every clock cycle (Issue=1.0)void array_clear_4array_clear_4(int *dest, int n) {

int i;int limit = n-3;for (i = 0; i < limit; i+=4) {

dest[i] = 0;dest[i+1] = 0;dest[i+2] = 0;dest[i+3] = 0;

}for ( ; i < n; i++)

dest[i] = 0;}

Function CPEarray_clear_4 1.0

6

Store Performance

void write_read(int *src, int *dest, int n){

int cnt = n;int val = 0;

while (cnt--) {*dest = val;val = (*src)+1;

}}

Example A: write_read(&a[0],&a[1],3)

vala

cnt-10 17

3

0

initial

-10 02

-9

iter1

-10 -91

-9

iter2

-10 -90

-9

iter3

Example B: write_read(&a[0],&a[0],3)

vala

cnt-10 17

3

0

initial

0 172

1

iter1

1 171

2

iter2

2 170

3

iter3

Function CPEExample A 2.0Example B 6.0

7

Load and Store Units

LoadUnit

Store Unit

Data Cache

Address Data

Store buffer

address dataMatchingaddresses

Data

address

Address Data

8

Graphical Representation

%eax %ebx %ecx %edx

%eax %ebx %ecx %edx

s_addr

load

sub

jne

s_data

addt

movl %eax,(%ecx)

addl $1,%eaxsubl $1,%edxjne loop

movl (%ebx), %eax

//inner-loop while (cnt--) {

*dest = val; val = (*src)+1; }

9


%eax %ebx %ecx %edx

%eax %edx

sub

s_addr

jg

s_data

add

load

%eax %edx

%eax %edx

sub

S-data

add

load

1

2

3


sub

S_data

add

load

sub

S_data

add

load

sub

load

mul

mul

sub

load

mul

mul

Example A Example BCritical Path

Function CPEExample A 2.0Example B 6.0

Getting High Performance

• High-level design– Choose appropriate algorithms and data

structures for the problem at hand– Be especially vigilant to avoid algorithms or

coding techniques that yield asymptotically poor performance


• Basic coding principles– Avoid optimization blockers so that a compiler

can generate efficient code. – Eliminate excessive function calls

• Move computations out of loops when possible• Consider selective compromises of program

modularity to gain greater efficiency– Eliminate unnecessary memory references.

• Introduce temporary variables to hold intermediate results

• Store a result in an array or global variable only when the final value has been computed.


• Low-level optimizations– Unroll loops to reduce overhead and to enable

further optimizations– Find ways to increase instruction-level

parallelism by techniques such as multiple accumulators and reassociation

– Rewrite conditional operations in a functional style to enable compilation via conditional data transfers

– Write cache friendly code

14

Performance Tuning

• Identify – Which is the hottest part of the program

– Using a very useful method profiling

• Instrument the program

• Run it with typical input data

• Collect information from the result

• Analysis the result

15

Examples

unix> gcc –O1 –pg prog.c –o prog

unix> ./prog file.txt

unix> gprof prog

% cumulative self self totaltime seconds seconds calls s/call s/call name97.58 173.05 173.05 1 173.05 173.05 sort_words2.36 177.24 4.19 965027 0.00 0.00 find_ele_rec0.12 177.46 0.22 12511031 0.00 0.00 Strlen

16

Principle

• Interval counting– Maintain a counter for each function

• Record the time spent executing this function

– Interrupted at regular time (1ms)• Check which function is executing when

interrupt occurs• Increment the counter for this function

• The calling information is quite reliable• By default, the timings for library

functions are not shown

17

Program Example

• Task– Analyzing the n-gram statistics of a text

document– an n-gram is a sequence of n words

occurring in a document– reads a text file, – creates a table of unique n-grams

– specifying how many times each one occurs– sorts the n-grams in descending order of

occurrence

18

Program Example

• Steps– Convert strings to lowercase– Apply hash function– Read n-grams and insert into hash table

• Mostly list operations• Maintain counter for each unique n-gram

– Sort results• Data Set

• Collected works of Shakespeare• 965,028 total words, 23,706 unique• N=2, called bigrams• 363,039 unique bigrams

19

158655725 find_ele_rec [5]

4.19 0.02 965027/965027 insert_string [4]

[5] 2.4 4.19 0.02 965027+158655725 find_ele_rec [5]

0.01 0.01 363039/363039 new_ele [10]

0.00 0.01 363039/363039 save_string [13]

158655725 find_ele_rec [5]

• Ratio : 158655725/965027 = 164.4• The average length of a list in one hash bucket is

164

Example

20

Code Optimizations

– First step: Use more efficient sorting function– Library function qsort

21

Further Optimizations

22

Optimizaitons

• Iter first: Use iterative function to insert elements in linked list– Causes code to slow down

• Iter last: Iterative function, places new entry at end of list– Tend to place most common words at front of

list• Big table: Increase number of hash

buckets• Better hash: Use more sophisticated hash

function• Linear lower: Move strlen out of loop

23

Code Motion

1 /* Convert string to lowercase: slow */

2 void lower1(char *s)

3 {

4 int i;

5

6 for (i = 0; i < strlen(s); i++)

7 if (s[i] >= ’A’ && s[i] <= ’Z’)

8 s[i] -= (’A’ - ’a’);

9 }

10

24

Code Motion

11 /* Convert string to lowercase: faster */

12 void lower2(char *s)

13 {

14 int i;

15 int len = strlen(s);

16

17 for (i = 0; i < len; i++)

18 if (s[i] >= ’A’ && s[i] <= ’Z’)

19 s[i] -= (’A’ - ’a’);

20 }

21

25

Code Motion

22 /* Sample implementation of library function strlen */

23 /* Compute length of string */

24 size_t strlen(const char *s)

25 {

26 int length = 0;

27 while (*s != ’\0’) {

28 s++;

29 length++;

30 }

31 return length;

32 }

26

Code Motion

27

• Benefits– Helps identify performance bottlenecks

– Especially useful when have complex system with many components

• Limitations– Only shows performance for data tested

– E.g., linear lower did not show big gain, since words are short

• Quadratic inefficiency could remain lurking in code

– Timing mechanism fairly crude• Only works for programs that run for > 3 seconds

Performance Tuning

28

Tnew = (1-)Told + (Told)/k

= Told[(1-) + /k]

S = Told / Tnew = 1/[(1-) + /k]

S = 1/(1-)

Amdahl’s Law

More Code Optimization

Documents