Parallel & Distributed Computer Systems Dr. Mohammad Ansari
Jul 04, 2015
Parallel & Distributed
Computer Systems
Dr. Mohammad Ansari
Course Details
Delivery◦ Lectures/discussions: English
◦ Assessments: English
◦ Ask questions in class if you don’t understand
◦ Email me after class if you do not want to ask in class
◦ DO NOT LEAVE QUESTIONS TILL THE DAY BEFORE THE EXAM!!!
Assessments (this may change)◦ Homework (~1 per week): 10%
◦ Midterm: 20%
◦ 1 project + final exam OR 2 projects: 35%+35%
Course Details
Textbook◦ Principles of Parallel Programming, Lin & Snyder
Other sources of information:◦ COMP 322, Rice University
◦ CS 194, UC Berkeley
◦ Cilk lectures, MIT
Many sources of information on the
internet for writing parallelized code
Teaching Materials & Assignments
Everything is on Jusur◦ Lectures
◦ Homeworks
Submit homework through Jusur
Homework is given out on Saturday
Homework due following Saturday
You lose 10% for each day late
Homework 1
First homework is available on Jusur◦ Install Linux on your computer
It is needed for future homework
It is needed to access the supercomputers
◦ Check settings/hardware
Submit pictures of your settings
Submit description of your processor
◦ Deadline: 27/03/1431 (submit on Jusur)
Cheating in Homework/Projects
Cheating◦ If you cheat, you get zero
◦ If you help others cheat, you will also get zero
◦ Copy + paste from Internet, e.g. Wikipedia, or
elsewhere, is also cheating (called plagiarism)
◦ You can read any source of information, but you
must write answers in your own words
◦ If you have problems, please ask for help.
Outline
Previous lecture:◦ Why study parallel computing?
◦ Topics covered on this course
This lecture:◦ Example problem
Next week:◦ Parallel processor architectures
Example Problem
We will parallelize a simple problem
Begin to explore some of the issues
related to parallel programming, and
performance of parallel programs
Example Problem: Array Sum
Add all the numbers in a large array
It has 100 million elements
int size = 100000000;
int array[] = {7,3,15,10,13,18,6,4,…};
What code should we write for a
sequential program?
Example Problem: Sequential
int sum = 0;
int i = 0;
for(i = 0; i < size; i++) {
sum += array[i]; //sum=sum+array[i];
}
Example Problem: Sequential
How Do We Parallelize?
Objective: Thinking about parallelism◦ Multiple processors need something to do
A program/software has to be split into parts
Each part can be executed on a different processor.
◦ How do we improve performance over single processor?
If a problem takes 2 seconds on a single processor
And we break it into two (equal) parts: 1 second for each part
And we execute the two parts separately, but in parallel, on two processors, then we improve performance
How Do We Parallelize?
Part 0 Part 1
Sequential Parallel
Part 0
Part 1
CPU 0 CPU 0 CPU 1
Time
1
2
How Do We Start Parallelizing?
What parts can be done separately?◦ What parts can we do on separate processors?
◦ Meaning: What parts have no data dependence
◦ Data dependence:
The execution of an instruction (or line of
code) is dependent on execution of a previous
instruction (or line of code).
◦ Data independence:
The execution of an instruction (or line of
code) is not dependent on execution of a
previous instruction (or line of code).
Example of Data Dependence
int x = 0;
int y = 5;
x = 3;
y = y + x; //Is this line dependent on
the previous line?
Data Dependence & Parallelism
In a sequential program, data dependence does not matter: each instruction executes in sequence. ◦ Instructions execute one by one
In a parallel program, data independence allows parallel execution of instructions. Data dependence prevents parallel execution of instructions.◦ Reduces parallel performance
◦ Reduces number of processors that can be used
Why is Data Dependence Bad For
Parallel Programs? Does not allow correct parallel execution
CPU0 CPU1
x = 3; y = 5; //(5 + 0)
x = 3; y = y + x;
Why is Data Dependence Bad For
Parallel Programs? Does not allow correct parallel execution
CPU0 CPU1
x = 3;
y = 8; //(5 + 3)
x = 3;
y = y + x;
WAIT
Why is Data Dependence Bad For
Parallel Programs? Does not allow correct parallel execution
CPU0
x = 3; y = 8;
x = 3;
y = y + x;
Example of Data Independence
int x = 0;
int y = 5;
x = 3;
y = y + 5; //Is this line dependent on
the previous line?
Why is Data Independence Useful?
Allows correct parallel execution
CPU0 CPU1
x = 3; y = 10;
x = 3; y = y + 5;
Back to Array Sum Example
Does the code have data dependence?
int sum = 0;
for(int i = 0; i < size; i++) {
sum += array[i]; //sum=sum+array[i];
}
Back to Array Sum Example
Does the code have data dependence?
int sum = 0;
for(int i = 0; i < size; i++) {
sum += array[i]; //sum=sum+array[i];
}
Not so easy to see
Back to Array Sum Example
Let’s unroll the loop:
int sum = 0;sum += array[0]; //sum=sum+array[0];sum += array[1]; //sum=sum+array[1];sum += array[2]; //sum=sum+array[2];sum += array[3]; //sum=sum+array[3];…
Now we can see dependence!
Example Problem: Sequential
Removing Dependencies
Sometimes this is possible.◦ Dependencies discussed in detail later.
Tip: Can be useful to look at the
problem being solved by the
code, and not the code itself.
Break Sum into Pieces
7 3 1 0 2 9 5 8 3 6
SUM
S1S0
P0 P1
Some Details…
A program executes inside a process
If we want to use multiple processors◦ We need multiple processes
◦ One process for each processor (not fixed rule)
Processes are big, heavyweight
Threads are lighter than processes◦ But same strategy
◦ One thread for each processor (not fixed rule)
We will talk about threads and processes later, if necessary
What Does the Code Look Like?
int numThreads = 2; //Assume one thread per core, and 2 cores
int sum = 0;
int i = 0;
int middleSum[numThreads];
int threadSetSize = size/numThreads
//Each thread will execute this code with a different threadID
for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++)
{
middleSum[threadID] += array[i];
}
//Only thread 0 will execute this code
if (threadID==0) {
for(i = 0; i < numThreads; i++) {
sum += middleSum[i];
}
}
Load Balancing
Which processor is doing more work?
7 3 1 0 2 9 5 8 3 6
SUM
S1S0
P0 P1
Load Balancing
Part 0
Part 1
Sequential Parallel
Part 0
Part 1
P0 P0 P1
Time
1.0
2.0
1.3
Example Problem: Array Sum
Parallelized code is more complex
Requires us to think differently about
how to solve the problem◦ Need to think about breaking it into parts
◦ Analyze data dependencies, remove if possible
◦ Need to load balance for better performance
Example Problem: Array Sum
However, the parallel code is broken◦ Thread 0 adds all the middle sums.
◦ What if thread 0 finishes its own work, but
other threads have not?
Synchronization
P0 will probably finish before P1
7 3 1 0 2 9 5 8 3 6
SUM
S1S0
P0 P1
How Can We Fix The Code to
GUARANTEE It Works Correctly?int numThreads = 2; //Assume one thread per core, and 2 cores
int sum = 0;
int i = 0;
int middleSum[numThreads];
int threadSetSize = size/numThreads
//Each thread will execute this code with a different threadID
for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++)
{
middleSum[threadID] += array[i];
}
//Only thread 0 will execute this code
if (threadID==0) {
for(i = 0; i < numThreads; i++) {
sum += middleSum[i];
}
}
Synchronization
Sometimes we need to
coordinate/organize threads
If we don’t, the code might calculate the
wrong answer to the problem
Can happen even if load balance is perfect
Synchronization is concerned with this
coordination / organization
Code with Synchronization Fixed
int numThreads = 2; //Assume one thread per core, & 2 cores
int sum = 0;
int i = 0;
int middleSum[numThreads];
int threadSetSize = size/numThreads
//Each thread will execute this code with a different threadID
for( i = threadID*threadSetSize; i < (threadID+1)*threadSetSize; i++)
{
middleSum[threadID] += array[i];
}
waitForAllThreads(); //Wait for all threads
//Only thread 0 will execute this code
if (threadID==0) {
for(i = 0; i < numThreads; i++) {
sum += middleSum[i];
}
}
Synchronization
The example shows a barrier
This is one type of synchronization
Barriers require all threads to reach
that point in the code, before any
thread is allowed to continue
It is like a gate. All threads come to
the gate, and then it opens.
Generalizing the Solution
We only looked at how to parallelize
for 2 threads
But the code is more general◦ Can use any number of threads
◦ Important that code is written this way
◦ We will look at this in more detail later
Parallel Program
Performance
Now the program is correct Let’s look at performance
0
0.2
0.4
0.6
0.8
1
1 Thread 2 Threads 4 Threads
Time on 2-core Processor
Performance
Two-threads are not 2x fast. Why?
◦ The problem is called false sharing
◦ To understand this, we have to look at the
computer architecture
◦ We will study this in the next lecture
Four-threads slower than two-threads.
Why?
◦ The processor only has two cores
◦ Four threads adds scheduling overhead, wastes
time
Summary
Used an example to start looking at
how to parallelize code, and some of
the main issues◦ Data dependence
◦ Load balancing
◦ Synchronization
Each will be discussed in more detail
in later lectures