1 More Code Optimization
Jan 16, 2016
1
More Code Optimization
2
Outline
• Memory Performance• Tuning Performance
• Suggested reading
– 5.12 ~ 5.14
3
Load Performance
• load unit can only initiate one load operation every clock cycle (Issue=1.0)
typedef struct ELE {struct ELE *next ;int data ;
} list_ele, *list_ptr ;
int list_len(list_ptr ls) {int len = 0 ;while (ls) {
len++ ;ls = ls->next;
} return len ;
}
len in %eax, ls in %rdi.L11:
addl $1, %eaxmovq (%rdi), %rditestq %rdi, %rdijne .L11
Function CPElist_len 4.0
load latency 4.0
4
Store Performance
• store unit can only initiate one store operation
every clock cycle (Issue=1.0)void array_cleararray_clear(int *dest, int n) {
int i;for (i = 0; i < n; i++)
dest[i] = 0;}
Function CPEarray_clear 2.0
5
Store Performance
• store unit can only initiate one store operation every clock cycle (Issue=1.0)void array_clear_4array_clear_4(int *dest, int n) {
int i;int limit = n-3;for (i = 0; i < limit; i+=4) {
dest[i] = 0;dest[i+1] = 0;dest[i+2] = 0;dest[i+3] = 0;
}for ( ; i < n; i++)
dest[i] = 0;}
Function CPEarray_clear_4 1.0
6
Store Performance
void write_read(int *src, int *dest, int n){
int cnt = n;int val = 0;
while (cnt--) {*dest = val;val = (*src)+1;
}}
Example A: write_read(&a[0],&a[1],3)
vala
cnt-10 17
3
0
initial
-10 02
-9
iter1
-10 -91
-9
iter2
-10 -90
-9
iter3
Example B: write_read(&a[0],&a[0],3)
vala
cnt-10 17
3
0
initial
0 172
1
iter1
1 171
2
iter2
2 170
3
iter3
Function CPEExample A 2.0Example B 6.0
7
Load and Store Units
LoadUnit
Store Unit
Data Cache
Address Data
Store buffer
address dataMatchingaddresses
Data
address
Address Data
8
Graphical Representation
%eax %ebx %ecx %edx
%eax %ebx %ecx %edx
s_addr
load
sub
jne
s_data
addt
movl %eax,(%ecx)
addl $1,%eaxsubl $1,%edxjne loop
movl (%ebx), %eax
//inner-loop while (cnt--) {
*dest = val; val = (*src)+1; }
9
Graphical Representation
%eax %ebx %ecx %edx
%eax %edx
sub
s_addr
jg
s_data
add
load
%eax %edx
%eax %edx
sub
S-data
add
load
1
2
3
Graphical Representation
sub
S_data
add
load
sub
S_data
add
load
sub
load
mul
mul
sub
load
mul
mul
Example A Example BCritical Path
Function CPEExample A 2.0Example B 6.0
Getting High Performance
• High-level design– Choose appropriate algorithms and data
structures for the problem at hand– Be especially vigilant to avoid algorithms or
coding techniques that yield asymptotically poor performance
Getting High Performance
• Basic coding principles– Avoid optimization blockers so that a compiler
can generate efficient code. – Eliminate excessive function calls
• Move computations out of loops when possible• Consider selective compromises of program
modularity to gain greater efficiency– Eliminate unnecessary memory references.
• Introduce temporary variables to hold intermediate results
• Store a result in an array or global variable only when the final value has been computed.
Getting High Performance
• Low-level optimizations– Unroll loops to reduce overhead and to enable
further optimizations– Find ways to increase instruction-level
parallelism by techniques such as multiple accumulators and reassociation
– Rewrite conditional operations in a functional style to enable compilation via conditional data transfers
– Write cache friendly code
14
Performance Tuning
• Identify – Which is the hottest part of the program
– Using a very useful method profiling
• Instrument the program
• Run it with typical input data
• Collect information from the result
• Analysis the result
15
Examples
unix> gcc –O1 –pg prog.c –o prog
unix> ./prog file.txt
unix> gprof prog
% cumulative self self totaltime seconds seconds calls s/call s/call name97.58 173.05 173.05 1 173.05 173.05 sort_words2.36 177.24 4.19 965027 0.00 0.00 find_ele_rec0.12 177.46 0.22 12511031 0.00 0.00 Strlen
16
Principle
• Interval counting– Maintain a counter for each function
• Record the time spent executing this function
– Interrupted at regular time (1ms)• Check which function is executing when
interrupt occurs• Increment the counter for this function
• The calling information is quite reliable• By default, the timings for library
functions are not shown
17
Program Example
• Task– Analyzing the n-gram statistics of a text
document– an n-gram is a sequence of n words
occurring in a document– reads a text file, – creates a table of unique n-grams
– specifying how many times each one occurs– sorts the n-grams in descending order of
occurrence
18
Program Example
• Steps– Convert strings to lowercase– Apply hash function– Read n-grams and insert into hash table
• Mostly list operations• Maintain counter for each unique n-gram
– Sort results• Data Set
• Collected works of Shakespeare• 965,028 total words, 23,706 unique• N=2, called bigrams• 363,039 unique bigrams
19
158655725 find_ele_rec [5]
4.19 0.02 965027/965027 insert_string [4]
[5] 2.4 4.19 0.02 965027+158655725 find_ele_rec [5]
0.01 0.01 363039/363039 new_ele [10]
0.00 0.01 363039/363039 save_string [13]
158655725 find_ele_rec [5]
• Ratio : 158655725/965027 = 164.4• The average length of a list in one hash bucket is
164
Example
20
Code Optimizations
– First step: Use more efficient sorting function– Library function qsort
21
Further Optimizations
22
Optimizaitons
• Iter first: Use iterative function to insert elements in linked list– Causes code to slow down
• Iter last: Iterative function, places new entry at end of list– Tend to place most common words at front of
list• Big table: Increase number of hash
buckets• Better hash: Use more sophisticated hash
function• Linear lower: Move strlen out of loop
23
Code Motion
1 /* Convert string to lowercase: slow */
2 void lower1(char *s)
3 {
4 int i;
5
6 for (i = 0; i < strlen(s); i++)
7 if (s[i] >= ’A’ && s[i] <= ’Z’)
8 s[i] -= (’A’ - ’a’);
9 }
10
24
Code Motion
11 /* Convert string to lowercase: faster */
12 void lower2(char *s)
13 {
14 int i;
15 int len = strlen(s);
16
17 for (i = 0; i < len; i++)
18 if (s[i] >= ’A’ && s[i] <= ’Z’)
19 s[i] -= (’A’ - ’a’);
20 }
21
25
Code Motion
22 /* Sample implementation of library function strlen */
23 /* Compute length of string */
24 size_t strlen(const char *s)
25 {
26 int length = 0;
27 while (*s != ’\0’) {
28 s++;
29 length++;
30 }
31 return length;
32 }
26
Code Motion
27
• Benefits– Helps identify performance bottlenecks
– Especially useful when have complex system with many components
• Limitations– Only shows performance for data tested
– E.g., linear lower did not show big gain, since words are short
• Quadratic inefficiency could remain lurking in code
– Timing mechanism fairly crude• Only works for programs that run for > 3 seconds
Performance Tuning
28
Tnew = (1-)Told + (Told)/k
= Told[(1-) + /k]
S = Told / Tnew = 1/[(1-) + /k]
S = 1/(1-)
Amdahl’s Law