Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Infor mation Sciences Florida International University March 2009 Parallel Computing Explained Parallel Performance Analysis
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Speedupy The speedup of your code tells you how much performance gain is
achieved by running your program in parallel on multipleprocessors.y A simple definition is that it is the length of time it takes a program to
run on a single processor, divided by the time it takes to run on a
multiple processors.y Speedup generally ranges between 0 and p, where p is the number of
processors.
y Scalabilityy When you compute with multiple processors in a parallel
environment, you will also want to know how your code scales.y The scalability of a parallel code is defined as its ability to achieveperformance proportional to the number of processors used.
y As you run your code with more and more processors, you want tosee the performance of the code continue to improve.
y Computing speedup is a good way to measure how a program scales
Speedup Extremesy The extremes of speedup happen when speedup is
y greater than p, called super-linear speedup,
y less than 1.
y Super-Linear Speedup
y You might wonder how super-linear speedup can occur. How canspeedup be greater than the number of processors used?y The answer usually lies with the program's memory use.When using multiple
processors, each processor only gets part of the problem compared to thesingle processor case. It is possible that the smaller problem can make betteruse of the memory hierarchy, that is, the cache and the registers. For
example, the smaller problem may fit in cache when the entire problemwould not.
y When super-linear speedup is achieved, it is often an indication that thesequential code, run on one processor, had ser ious cache miss probl ems.
y The most common programs that achieve super-linear speedup
are those that solve dense linear algebra problems.
Amdahl's Lawy The interpretation of Amdahl's Law is that speedup is limited
by the fact that not all parts of a code can be run in parallel.y Substituting in the formula, when the number of processors goes to
infinity, your code's speedup is still limited by 1 / f .
y
Amdahl's Law shows that the sequential fraction of code has astrong effect on speedup.y This helps to explain the need for large problem sizes when using
parallel computers.y It is well known in the parallel computing community, that you
cannot take a small application and expect it to show good
performance on a parallel computer.y To get good performance, you need to run large applications, with
large data array sizes, and lots of computation.y The reason for this is that as the problem size increases the
opportunity for parallelism grows, and the sequential fractionshrinks, and it shrinks in its importance for speedup.
y Speedup is limited when the problem size is too small to take best advantageof a parallel computer.
y In addition, speedup is limited when the problem size is fixed.
y That is, when the problem size doesn't grow as you compute with moreprocessors.
y T oo much sequential codey Speedup is limited when there's too much sequential code.
y This is shown by Amdahl's Law.
y T oo much par all el over head y Speedup is limited when there is too much parallel overhead compared to the
amount of computation.
y These are the additional CPU cycles accumulated in creating parallel regions,creating threads, synchronizing threads, spin/blocking threads, and endingparallel regions.
y Load imbalancey Speedup is limited when the processors have different workloads.
y The processors that finish early will be idle while they are waiting for theother processors to catch up.
Memory Contention Limitationy Many of these tools can be used with the PAPI performance counter
interface.y Be sure to refer to the man pages and webpages on the NCSA website for
more information.y If the output of the utility shows that memory contention is a problem, you
will want to use some programming techniques for reducing memorycontention.
y A good way to reduce memory contention is to access elements from theprocessor's cache memory instead of the main memory.
y Some programming techniques for doing this are:y Access arrays with unit `.
y Order nested do loops (in Fortran) so that the innermost loop index is the leftmost
index of the arrays in the loop. For the C language, the order is the opposite of Fortran.
y Avoid specific array sizes that are the same as the size of the data cache or that areexact fractions or exact multiples of the size of the data cache.
y Pad common blocks.
y These techniques are called cache tuning optimizations. The details forperforming these code modifications are covered in the section on C acheOptimization of this lecture.