High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University March 6 st , 2007
Jan 04, 2016
High Performance Computing: Concepts, Methods & Means
Parallel Algorithms 2
Prof. Thomas SterlingDepartment of Computer Science
Louisiana State University
March 6st, 2007
2
Topics
• Introduction
• Midterm Exam Review
• Matrix Multiplication
• N-Body Problem
• Fast Fourier Transform (FFT)
• Summary – Materials for Test
3
Topics
• Introduction• Midterm Exam Review
• Matrix Multiplication
• N-Body Problem
• Fast Fourier Transform (FFT)
• Summary – Materials for Test
Half-Way Through (almost)
• More of more of the same• Today: some basic algorithms (in MPI)
– Matrix-matrix multiply– N-body– FFT
• But first: a brief walk through in preparation for the Midterm exam! (Good Luck)
5
Topics
• Introduction
• Midterm Exam Review• Matrix Multiplication
• N-Body Problem
• Fast Fourier Transform (FFT)
• Summary – Materials for Test
How to Prepare for Midterm
• Closed Book exam• Will look like a problem set (use as a template)• Study aids:
– Summary slide at the end of each lecture– Problem sets
• Will emphasize– Basic knowledge– Skills– Performance models
• Note:– To be held in room 338 Johnston Hall– 1 hour 15 minutes
– Bring a calculator (know how to use it)
HPC in Overview (1st half)• Supercomputing evolves as interplay
– Device technology– Computer architecture– Execution models and programming methods
• Three classes of parallel computing– Capacity– Cooperative– Capability
• Three execution models– Throughput – Shared memory multithreaded– Communicating sequential processes (message passing)
• Three programming formalisms– Condor– OpenMP– MPI
• Performance modeling and measurement– Metrics– Models– Measurement tools
S1 – L3 - Benchmarking
• Basic performance metrics (slide 4)• Definition of benchmark in own words; purpose of
benchmarking; properties of good benchmark (slides 5, 6, 7)
• Linpack: what it is, what does it measure, concepts and complexities (slides 15, 17, 18)
• HPL: algorithms and concepts (slides 21 through 24)• Linpack compare and contrast (slide 25)• General knowledge about HPCC and NPB suites
(slides 31, 34, 35)• Benchmark result interpretation (slides 49, 50)
8
S1 – L4 : Capacity Computing• Understand material on slide (4,5),(7,8)• Understand example detailed in slides 17, 18• Understand (19) and be able to derive (20,21), (22,
23)• Understand Condor concepts detailed in slides
30,31,32• Condor Commands (37-47) : know what the basic
commands are, what they do and interpret output presented by them etc. (No need to memorize command-line options)
• Understand issues listed on slide 53• Required reading materials :
– http://www.cct.lsu.edu/~cdekate/7600/beowulf-chapter-rev1.pdf
– Specific pages to focus on : 3-16 9
10
S2 – L1 : Architecture
• Need to know content on slides 11, 15, 22, 23, 33 • Understand how each of the technologies listed
on slide 7 affects performance• Understand concepts on slides 8,9• Understand concepts on slides 17, 18, 20 • Understand pipelining concepts and equations
detailed in slides 27, 28• Understand vector processing concepts and
equations detailed in slides 29, 30
10
11
S2 – L2 : SMP
• Please make sure that you have addressed all points outlined on slide 5
• Understand content on slide 7• Understand concepts, equations, problems
on slides 11, 12, 13• Understand content on 21, 24, 26, 29• Understand concepts on slides 32, 33• Understand content on slides 36, 55
• Required reading material :
http://arstechnica.com/articles/paedia/hardware/pcie.ars/1
S2 – L3 : PThreads
• Performance & cpi: slide 8• Multi thread concepts: 13, 16, 18, 19, 22, 24, 31• Thread implementations: 35 – 37• Pthreads: 43 – 45, 48, 55
S2 – L4 : Open MP
• Components: 6• Compiling: 9, 12• Environment variables: 13, 14• Top level: 15• Shared data: 18, 19, 20• Parallel Flow Control: 23, 24, 25• Synchronization: 32, 34• Performance: 39• Synopsis: 44
S2 – L5 : Performance 2
• Measuring System Operation slides: 11, 13, 17• Gprof slides: 21, 22• Perfsuite slides: 25, 29• PAPI slides: 33 – 36 (inclusive)• Tau slides: 56 – 60 (inclusive)
15
S3 – L1 : Communicating Sequential Processes
• Basics: slides 6 – 9, 16
• CSP: slides 19
• Unix: slides 24, 28 - 30
S3 – L2 : MPI
• MPI standard: slides 4,7• Compile and Run an MPI Program: slides 10,11 • Environment functions: slides 12,14• Point-to-point functions: slides 27,28• Blocking vs. nonblocking: slides 25,26• Deadlock: slides 29-31• Basic collective functions: slides
33,34,36,38,40,41,43
17
S3 – L3 : Performance 3
• Essential MPI - Slide: 9• Performance Models - Slide: 12, 15, 16, 18 (Hockney)• LogP - Slide: 20 – 23• Effective Bandwidth – Slide: 30• Tau/MPI – Slide: 41, 43
18
S3 – L4 : Parallel Algorithms
• Introduction – Slides: 4, 5, 6• Array decomposition – Slides: 11, 12• Mandelbrot load balancing – Slides: 25, 26• Monte Carlo create Communicators – Slides: 40, 42
System level Overview• Understand the 3 classes of parallel computing (capacity,
cooperative, capacity).• Software System
– Understand the Software stack (eg OS, Compilers… ) used in various supercomputers
– Conceptual understanding of different Parallel Programming models (eg. shared memory, message passing …), advantages and disadvantages of each system.
• Computer Architecture– Understand and be able to discuss different sources of performance
degradation (latency, overhead…etc)
– Understand Amdahl’s Law and be able to solve problems related to the same, as well as scalability, efficiency, and cpi
– Understand and be able to describe the different forms of Hardware parallelism (pipelining, ILP, multiprocessors (SIMD,MIMD etc)
– Understand numerical problems provided in section 1 Problem Sets(1,2,3) and associated equations & theory behind them 19
Execution Models
• Throughput Execution Model (eg. Condor) – Be aware of the various condor commands– Thoroughly understand core Condor concepts (eg. Class Ads
and Matchmaking), and how these concepts work together
• Shared Memory multithreaded (eg. OpenMP)– Understand sources of contention (Race Condition) and how to
resolve them (Critical Sections etc… )– Understand various OpenMP constructs and how they work.
(eg be able to answer questions like how and when to use “critical” construct and its performance implication etc. )
– Understand the concept of Shared, Private and Reduction variables.
– Be able read and understand simple OpenMP code ( C )and be able to make conceptual changes where and when asked.
20
Execution Models and Performance
• Communicating Sequential Processes– conceptual understanding of CSP– Know the meaning of various MPI constructs common usage– Understand the fundamental concepts like deadlock and how to
resolve them– Be able to read a small code snippet and correct conceptual ( NOT
SYNTACTICAL ) problems. You donot need to memorize SYNTAX of MPI constructs.
• Performance & Benchmarking– Be aware of the Top500 list, and benchmarks used – be aware of the different benchmarks, what each of them stress
(linpack, HPL, different components of HPCC …)– be aware of the different performance tools discussed in class and
what they measure.– Understand and be able to solve Problems related to LogP Model
21
Key Terms and Concepts
• Speedup : Relative reduction of execution time of a fixed size workload through parallel execution
• Efficiency : Ratio of the actual performance to the best possible performance.
22
processorsNontimeexecution
processoroneontimeexecutionSpeedup
____
____
)______(
____
processorsofnumberprocessorsmultipleontimeexecution
processoroneontimeexecutionEfficiency
Ideal Speedup Example
23
W
220
w1 w210 210
P28
210 210 210 210
Processors
212
P1
T(1)=220
T(28)=212
812
20
22
2Speedup
1222
2 0812
20
Efficiency
Units : steps
i
iwW
Ideal Speedup Issues
24
• W is total workload measured in elemental pieces of work (e.g. operations, instructions, etc.)
• T(p) is total execution time measured in elemental time steps (e.g. clock cycles) where p is # of execution sites (e.g. processors, threads)
• wi is work for a given task i
• Example: here we divide a million (really Mega) operation workload, W, in to a thousand tasks, w1 to w1024 each of a 1 K operations
• Assume 256 processors performing workload in parallel
• T(256) = 4096 steps, speedup = 256, Eff = 1
Amdahl’s Law
gf
f
S
Tgf
Tf
TS
Tg
fTfT
TTf
TTS
S
f
g
T
T
T
OO
O
OOA
OF
AO
F
A
O
1
1
1
1
appliedon acceleratin with computatio of up speed
daccelerate be n tocomputatio daccelerate-non offraction
ncomputatio ofportion dacceleratefor gain eperformancpeak
daccelerate becan n that computatio ofportion of time
ncomputatio dacceleratefor time
ncomputatio daccelerate-nonfor time
start end
TO
TF
start end
TA
TF/g
Overhead
26
vv ww
W=4v+4wW=4v+4w
P
Wwi
wvT
PW
P
WPP
PW
W
PWW
T
TS
P
1
11
v = overheadw = work unitW = Total workTi = execution time with i processorsP = # processors
P
iiwW
1
Assumption : Workload is infinitely divisible
Scalability & Overhead
27
gg
P
gP
gg
ggP
gg
wv
P
wv
PW
W
T
TS
w
v
P
WT
w
v
P
Wvw
Pw
Wvw
P
JT
WvWT
w
W
w
WtasksJ
11
1
1)(
#
1
1 when W >> v
v = overheadwg = work unitW = Total workTi = execution time with i processorsP = # ProcessorsJ = # Tasks
Scalability and Overhead for fixed sized work tasks
28
• W is divided in to J wg sized tasks
• Each task requires v overhead work to manage• For P processors there are approximates J/P tasks
to be performed in sequence so,
• TP is J(wg + v)/P
• Note that S = T1 / TP• So, S = P / (1 + v / wg)
Measuring LogP Parameters
• Finding L+2*o
– Proc 0: (MPI_Send() then MPI_Recv()) x N
– Proc 1: (MPI_Recv() then MPI_Send()) x N
– L+2*o = total time/N
Figure 1: Time diagram for benchmark 1(a) is Time diagram of processor 0(b) is Time diagram of processor 1
Measuring LogP Parameters
• Finding o
– Proc 0: (MPI_Send() then some_work then MPI_Recv() ) x N
– Proc 1: (MPI_Recv() then MPI_Send() then some_work) x N
– o = (1/2)total time/N – time(some_work)
– requires time(some_work) > 2*L+2*o
Figure 2: Time diagram for benchmark 2 with X > 2*L+Or+Os(a) is Time diagram of processor 1(b) is Time diagram of processor 2
Performance Metrics
• Peak floating point operations per second (flops)• Peak instructions per second (ips)• Sustained throughput
– flops, Mflops, Gflops, Tflops, Pflops– flops, Megaflops, Gigaflops, Teraflops, Petaflops– ips, Mips, …
• Cycles per instruction– cpi – Alternatively: instructions per cycle, ipc
• Memory access latency– cycles per second
• Memory access bandwidth– bytes per second– or Gigabytes per second, GBps, GB/s
• Bi-section bandwidth– bytes per second
31
32
CPI (continued)
cMmissmissMhitmissMRR
MmissmissMhitmissM
MRMMRR
MM
RR
c
tcpircpirmcpimI# T
cpircpircpi
mmcpimcpimcpi
I#I#m
I#I#m
tcpiI#T
1
1
0.1 where
33
Basic Performance Metrics• Time related:
– Execution time [seconds]• wall clock time• system and user time
– Latency– Response time
• Rate related:– Rate of computation
• floating point operations per second [flops]• integer operations per second [ops]
– Data transfer (I/O) rate [bytes/second]• Effectiveness:
– Efficiency [%]– Memory consumption [bytes]– Productivity [utility/($*second)]
• Modifiers:– Sustained– Peak– Theoretical peak
34
Basic Parallel (MPI) Program Steps
• Establish logical bindings• Initialize application execution environment• Distribute data and work• Perform core computations in parallel (across nodes)• Synchronize and Exchange intermediate data results
– Optional for non-embarrassingly parallel (cooperative)
• Detect “stop” condition– Maybe implicit with a barrier etc.
• Aggregate final results– Often a reduction operator
• Output results and error code• Terminate and return to OS
The Essential MPI
• API Elements : – MPI_Init(), MPI_Finalize()– MPI_Comm_size(), MPI_Comm_rank() – MPI_COMM_WORLD– Error checking using MPI_SUCCESS– MPI basic data types (slide 27)– Blocking : MPI_Send(), MPI_Recv()– Non-Blocking : MPI_Isend(), MPI_Irecv(), MPI_Wait()– Collective Calls : MPI_Barrier(), MPI_Bcast(), MPI_Gather(),
MPI_Scatter(), MPI_Reduce()
• Commands : – Running MPI Programs : mpirun– Compile : mpicc – Compile : mpif77
36
Topics
• Introduction
• Midterm Exam Review
• Matrix Multiplication• N-Body Problem
• Fast Fourier Transform (FFT)
• Summary – Materials for Test
37
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
Matrices — A ReviewAn n x m matrix
38
Matrix Multiplication
Multiplication of two matrices, A and B, produces the matrix Cwhose elements, ci,j (0 <= i < n, 0 <= j < m), are computed as follows:
where A is an n x l matrix and B is an l x m matrix.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
39
Matrix multiplication, C = A x B
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
40
Implementing Matrix MultiplicationSequential Code
Assume throughout that the matrices are square (n x n matrices).The sequential code to compute A x B could simply be
for (i = 0; i < n; i++)for (j = 0; j < n; j++) {
c[i][j] = 0;for (k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k] * b[k][j]; }
This algorithm requires n3 multiplications and n3 additions, leading to a sequential time complexity of O(n3). Very easy to parallelize.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
41
Block Matrix Multiplication
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
42Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
43
Performance Improvement
Using tree construction n numbers can be added in log n steps using n processors:
Computational timecomplexity of O(log n)using n3 processors.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.
44
Flowchart for Matrix Multiplication“master” “workers”
Initialize MPI EnvironmentInitialize MPI Environment
Initialize MPI EnvironmentInitialize MPI Environment
Initialize MPI EnvironmentInitialize MPI Environment
… Initialize MPI EnvironmentInitialize MPI Environment
Initialize ArrayInitialize Array
Partition Array into workloads Partition Array into workloads
Send Workload to “workers”
Send Workload to “workers”
Recv. workRecv. work Recv. workRecv. work … Recv. workRecv. work
wait for “workers“ to finish task
wait for “workers“ to finish task
Calculate matrix product
Calculate matrix product
Calculate matrix product
Calculate matrix product
Calculate matrix product
Calculate matrix product…
Send resultSend result Send resultSend result … Send resultSend result
Recv. resultsRecv. results
Print resultsPrint results
EndEnd
45
Matrix Multiplication (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>
#define NRA 62 /* number of rows in matrix A */#define NCA 15 /* number of columns in matrix A */#define NCB 7 /* number of columns in matrix B */#define MASTER 0 /* taskid of first task */#define FROM_MASTER 1 /* setting a message type */#define FROM_WORKER 2 /* setting a message type */
int main(argc,argv)int argc;char *argv[];{int numtasks, /* number of tasks in partition */
taskid, /* a task identifier */numworkers, /* number of worker tasks */source, /* task id of message source */dest, /* task id of message destination */mtype, /* message type */rows, /* rows of matrix A sent to each worker */averow, extra, offset, /* used to determine rows sent to each worker */i, j, k, rc; /* misc */
double a[NRA][NCA], /* matrix A to be multiplied */b[NCA][NCB], /* matrix B to be multiplied */c[NRA][NCB]; /* result matrix C */
MPI_Status status;
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c
46
Matrix Multiplication (source code)/* Initialize MPI Environment */
MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&taskid);MPI_Comm_size(MPI_COMM_WORLD,&numtasks);if (numtasks < 2 ) { printf("Need at least two MPI tasks. Quitting...\n"); MPI_Abort(MPI_COMM_WORLD, rc); exit(1); }numworkers = numtasks-1;
/* Master block*/ if (taskid == MASTER) { printf("mpi_mm has started with %d tasks.\n",numtasks); printf("Initializing arrays...\n"); for (i=0; i<NRA; i++) for (j=0; j<NCA; j++) a[i][j]= i+j; /* Initialize array a */ for (i=0; i<NCA; i++) for (j=0; j<NCB; j++) b[i][j]= i*j; /* Initialize array b */ /* Send matrix data to the worker tasks */ averow = NRA/numworkers; /* determining fraction of array to be processed by “workers” */ extra = NRA%numworkers; offset = 0; mtype = FROM_MASTER; /* Message Tag */
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c
47
Matrix Multiplication (source code) for (dest=1; dest<=numworkers; dest++) { /* To each worker send : Start point, number of rows to process, and sub-arrays to process */ rows = (dest <= extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset); MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); offset = offset + rows; }
/* Receive results from worker tasks */ mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */ for (i=1; i<=numworkers; i++) { source = i;
/* offset stores the (processing) starting point of work chunk */ MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
/* The array C contains the product of sub-array A and the array B */ MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status); printf("Received results from task %d\n",source); } printf("******************************************************\n"); printf("Result Matrix:\n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", c[i][j]); } printf("\n******************************************************\n"); printf ("Done.\n"); }
48
Matrix Multiplication (source code)
/**************************** worker task ************************************/ if (taskid > MASTER) { mtype = FROM_MASTER; MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
for (k=0; k<NCB; k++) for (i=0; i<rows; i++) { c[i][k] = 0.0; for (j=0; j<NCA; j++)
/* Calculate the product and store result in C */ c[i][k] = c[i][k] + a[i][j] * b[j][k]; } mtype = FROM_WORKER; MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
/* Worker sends the resultant array to the master */ MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD); } MPI_Finalize();}
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c
49
Demo : Matrix Multiplication[cdekate@compute-0-6 matrix_multiplication]$ mpirun -np 4 ./mpi_mmmpi_mm has started with 4 tasks.Initializing arrays...Sending 21 rows to task 1 offset=0Sending 21 rows to task 2 offset=21Sending 20 rows to task 3 offset=42Received results from task 1Received results from task 2Received results from task 3******************************************************Result Matrix:
0.00 1015.00 2030.00 3045.00 4060.00 5075.00 6090.00 0.00 1120.00 2240.00 3360.00 4480.00 5600.00 6720.00 0.00 1225.00 2450.00 3675.00 4900.00 6125.00 7350.00 0.00 1330.00 2660.00 3990.00 5320.00 6650.00 7980.00 0.00 1435.00 2870.00 4305.00 5740.00 7175.00 8610.00 0.00 1540.00 3080.00 4620.00 6160.00 7700.00 9240.00 0.00 1645.00 3290.00 4935.00 6580.00 8225.00 9870.00 ……… 0.00 6475.00 12950.00 19425.00 25900.00 32375.00 38850.00 0.00 6580.00 13160.00 19740.00 26320.00 32900.00 39480.00 0.00 6685.00 13370.00 20055.00 26740.00 33425.00 40110.00 0.00 6790.00 13580.00 20370.00 27160.00 33950.00 40740.00 0.00 6895.00 13790.00 20685.00 27580.00 34475.00 41370.00 0.00 7000.00 14000.00 21000.00 28000.00 35000.00 42000.00 0.00 7105.00 14210.00 21315.00 28420.00 35525.00 42630.00 0.00 7210.00 14420.00 21630.00 28840.00 36050.00 43260.00 0.00 7315.00 14630.00 21945.00 29260.00 36575.00 43890.00 0.00 7420.00 14840.00 22260.00 29680.00 37100.00 44520.00 ******************************************************Done.[cdekate@compute-0-6 matrix_multiplication]$
50
Topics
• Introduction
• Midterm Exam Review
• Matrix Multiplication
• N-Body Problem• Fast Fourier Transform (FFT)
• Summary – Materials for Test
51
N Bodies
OU Supercomputing Center for Education & Research
52OU Supercomputing Center for Education & ResearchImg src : http://www.lsbu.ac.uk/water
N-Body Problems
An N-body problem is a problem involving N “bodies” – that is, particles (e.g., stars, atoms) – each of which applies a force to all of the others.
For example, if you have N stars, then each of the N stars exerts a force (gravity) on all of the other N–1 stars.
Likewise, if you have N atoms, then every atom exerts a force on all of the other N–1 atoms. The forces are Coulombic and van der Waal’s.
53
2-Body Problem
When N is 2, you have – surprise! – a 2-Body Problem: exactly two particles, each exerting a force that acts on the other.
The relationship between the 2 particles can be expressed as a differential equation that can be solved analytically, producing a closed-form solution.
So, given the particles’ initial positions and velocities, you can immediately calculate their positions and velocities at any later time.
OU Supercomputing Center for Education & Research
54
N-Body Problems
For N of 3 or more, no one knows how to solve the equations to get a closed form solution.
So, numerical simulation is pretty much the only way to study groups of 3 or more bodies.
Popular applications of N-body codes include astronomy and chemistry.
Note that, for N bodies, there are on the order of N2 forces, denoted O(N2).
OU Supercomputing Center for Education & Research
55
N-Body Problems
Given N bodies, each body exerts a force on all of the other N–1 bodies.
Therefore, there are N • (N–1) forces in total.
You can also think of this as (N • (N–1))/2 forces, in the sense that the force from particle A to particle B is the same (except in the opposite direction) as the force from particle B to particle A.
OU Supercomputing Center for Education & Research
56
N-Body Problems
Given N bodies, each body exerts a force on all of the other N–1 bodies.
Therefore, there are N • (N–1) forces in total.
In Big-O notation, that’s O(N2) forces to calculate.
So, calculating the forces takes O(N2) time to execute.
But, there are only N particles, each taking up the same amount of memory, so we say that N-body codes are of:
• O(N) spatial complexity (memory)• O(N2) time complexity
OU Supercomputing Center for Education & Research
57
O(N2) Forces
Note that this picture shows only the forces between A and everyone else.
A
OU Supercomputing Center for Education & Research
58
How to Calculate?
Whatever your physics is, you have some function, F(A,B), that expresses the force between two bodies A and B.
For example,
F(A,B) = G · mA · mB / dist(A,B)2 where G is the gravitational constant and m is the mass of the
particle in question.If you have all of the forces for every pair of particles, then you can
calculate their sum, obtaining the force on every particle.
OU Supercomputing Center for Education & Research
59
How to Parallelize?
Okay, so let’s say you have a nice serial (single-CPU) code that does an N-body calculation.
How are you going to parallelize it?You could:• have a master feed particles to processes;• have a master feed interactions to processes;• have each process decide on its own subset of the
particles, and then share around the forces;• have each process decide its own subset of the
interactions, and then share around the forces.
OU Supercomputing Center for Education & Research
60
Do You Need a Master?
Let’s say that you have N bodies, and therefore you have ½N(N-1) interactions (every particle interacts with all of the others, but you don’t need to calculate both A B and B A).
Do you need a master?
Well, can each processor determine on its own either (a) which of the bodies to process, or (b) which of the interactions?
If the answer is yes, then you don’t need a master.
OU Supercomputing Center for Education & Research
61
N-Body “Pipeline” Implementation Flowchart
Create ring communicator
Initialize particle parameters
Copy local particle data to send buffer
Update positions of local particles
All iterations done?
Finalize MPI
N
Y
Initiate transmission of send buffer to the RIGHT neighbor in ring
Initiate reception of data from the LEFT neighbor in ring
Compute forces between local and send buffer particles
Processed particles from all remote nodes?
N
Wait for message exchange to complete
Copy particle data from receive buffer to send buffer
Y
Initialize MPI environment
62
N-Body (source code)
#include "mpi.h"#include <stdlib.h>#include <stdio.h>#include <string.h>#include <math.h>
/* Pipeline version of the algorithm... *//* we really need the velocities as well… */
/* Simplified structure describing parameters of a single particle */typedef struct { double x, y, z; double mass; } Particle;/* We use leapfrog for the time integration ... */
/* Structure to hold force components and old position coordinates of a particle */typedef struct { double xold, yold, zold; double fx, fy, fz; } ParticleV;
void InitParticles( Particle[], ParticleV [], int );double ComputeForces( Particle [], Particle [], ParticleV [], int );double ComputeNewPos( Particle [], ParticleV [], int, double, MPI_Comm );
#define MAX_PARTICLES 4000#define MAX_P 128
63
N-Body (source code)
main( int argc, char *argv[] ){ Particle particles[MAX_PARTICLES]; /* Particles on ALL nodes */ ParticleV pv[MAX_PARTICLES]; /* Particle velocity */ Particle sendbuf[MAX_PARTICLES], /* Pipeline buffers */
recvbuf[MAX_PARTICLES]; MPI_Request request[2]; int counts[MAX_P], /* Number on each processor */ displs[MAX_P]; /* Offsets into particles */ int rank, size, npart, i, j,
offset; /* location of local particles */ int totpart, /* total number of particles */
cnt; /* number of times in loop */ MPI_Datatype particletype; double sim_t; /* Simulation time */ double time; /* Computation time */ int pipe, left, right, periodic; MPI_Comm commring; MPI_Status statuses[2];
/* Initialize MPI Environment */ MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size );
/* Create 1-dimensional periodic Cartesian communicator (a ring) */ periodic = 1; MPI_Cart_create( MPI_COMM_WORLD, 1, &size, &periodic, 1, &commring ); MPI_Cart_shift( commring, 0, 1, &left, &right ); /* Find the closest neighbors in ring */
/* Calculate local fraction of particles */ if (argc < 2) {
fprintf( stderr, "Usage: %s n\n", argv[0] );MPI_Abort( MPI_COMM_WORLD, 1 );
} npart = atoi(argv[1]) / size; if (npart * size > MAX_PARTICLES) {
fprintf( stderr, "%d is too many; max is %d\n", npart*size, MAX_PARTICLES );MPI_Abort( MPI_COMM_WORLD, 1 );
} MPI_Type_contiguous( 4, MPI_DOUBLE, &particletype ); /* Data type corresponding to Particle struct */ MPI_Type_commit( &particletype );
/* Get the sizes and displacements */ MPI_Allgather( &npart, 1, MPI_INT, counts, 1, MPI_INT, commring ); displs[0] = 0; for (i=1; i<size; i++)
displs[i] = displs[i-1] + counts[i-1]; totpart = displs[size-1] + counts[size-1];
/* Generate the initial values */ InitParticles( particles, pv, npart); offset = displs[rank]; cnt = 10; time = MPI_Wtime(); sim_t = 0.0;
/* Begin simulation loop */ while (cnt--) {
double max_f, max_f_seg;
64
N-Body (source code)
65
N-Body (source code)/* Load the initial send buffer */memcpy( sendbuf, particles, npart * sizeof(Particle) );max_f = 0.0;for (pipe=0; pipe<size; pipe++) { if (pipe != size-1) {
/* Initialize send to the “right” neighbor, while receiving from the “left” */MPI_Isend( sendbuf, npart, particletype, right, pipe, commring, &request[0] );MPI_Irecv( recvbuf, npart, particletype, left, pipe, commring, &request[1] );
} /* Compute forces */ max_f_seg = ComputeForces( particles, sendbuf, pv, npart ); if (max_f_seg > max_f) max_f = max_f_seg;
/* Wait for updates to complete and copy received particles to the send buffer */ if (pipe != size-1) MPI_Waitall( 2, request, statuses ); memcpy( sendbuf, recvbuf, counts[pipe] * sizeof(Particle) );}/* Compute the changes in position using the already calculated forces */sim_t += ComputeNewPos( particles, pv, npart, max_f, commring );
/* We could do graphics here (move particles on the display) */ } time = MPI_Wtime() - time; if (rank == 0) {
printf( "Computed %d particles in %f seconds\n", totpart, time ); } MPI_Finalize(); return 0;}
66
N-Body (source code)/* Initialize particle positions, masses and forces */void InitParticles( Particle particles[], ParticleV pv[], int npart ){ int i; for (i=0; i<npart; i++) {
particles[i].x = drand48();particles[i].y = drand48();particles[i].z = drand48();particles[i].mass = 1.0;pv[i].xold = particles[i].x;pv[i].yold = particles[i].y;pv[i].zold = particles[i].z;pv[i].fx = 0;pv[i].fy = 0;pv[i].fz = 0;
}}/* Compute forces (2-D only) */double ComputeForces( Particle myparticles[], Particle others[], ParticleV pv[], int npart ){ double max_f, rmin; int i, j;
max_f = 0.0; for (i=0; i<npart; i++) { double xi, yi, mi, rx, ry, mj, r, fx, fy; rmin = 100.0; xi = myparticles[i].x; yi = myparticles[i].y; fx = 0.0; fy = 0.0;
67
N-Body (source code)for (j=0; j<npart; j++) { rx = xi - others[j].x; ry = yi - others[j].y; mj = others[j].mass; r = rx * rx + ry * ry; /* ignore overlap and same particle */ if (r == 0.0) continue; if (r < rmin) rmin = r; /* compute forces */ r = r * sqrt(r); fx -= mj * rx / r; fy -= mj * ry / r; } pv[i].fx += fx; pv[i].fy += fy; /* Compute a rough estimate of (1/m)|df / dx| */ fx = sqrt(fx*fx + fy*fy)/rmin; if (fx > max_f) max_f = fx; } return max_f;}
/* Update particle positions (2-D only) */double ComputeNewPos( Particle particles[], ParticleV pv[], int npart, double max_f, MPI_Comm commring ){ int i; double a0, a1, a2; static double dt_old = 0.001, dt = 0.001; double dt_est, new_dt, dt_new;
68
N-Body (source code)/* integation is a0 * x^+ + a1 * x + a2 * x^- = f / m */ a0 = 2.0 / (dt * (dt + dt_old)); a2 = 2.0 / (dt_old * (dt + dt_old)); a1 = -(a0 + a2); /* also -2/(dt*dt_old) */ for (i=0; i<npart; i++) { double xi, yi; /* Very, very simple leapfrog time integration. We use a variable step version to simplify time-step control. */ xi = particles[i].x; yi = particles[i].y; particles[i].x = (pv[i].fx - a1 * xi - a2 * pv[i].xold) / a0; particles[i].y = (pv[i].fy - a1 * yi - a2 * pv[i].yold) / a0; pv[i].xold = xi; pv[i].yold = yi; pv[i].fx = 0; pv[i].fy = 0; } /* Recompute a time step. Stability criteria is roughly 2/sqrt(1/m |df/dx|) >= dt. We leave a little room */ dt_est = 1.0/sqrt(max_f); if (dt_est < 1.0e-6) dt_est = 1.0e-6; MPI_Allreduce( &dt_est, &dt_new, 1, MPI_DOUBLE, MPI_MIN, commring ); /* Modify time step */ if (dt_new < dt) { dt_old = dt; dt = dt_new; } else if (dt_new > 4.0 * dt) { dt_old = dt; dt *= 2.0; } return dt_old;}
69
Demo : N-Body Problem
> mpirun –np 4 nbodypipe 4000Computed 4000 particles in 1.119051 seconds> mpirun –np 4 nbodypipe 4000Computed 4000 particles in 1.119051 seconds
70
Topics
• Introduction
• Midterm Exam Review
• Matrix Multiplication
• N-Body Problem
• Fast Fourier Transform (FFT)• Summary – Materials for Test
Serial FFT
• Let i = sqrt(-1) and index matrices and vectors from 0.• The Discrete Fourier Transform of an m-element vector v is: F*v Where F is the m×m matrix defined as: F[j,k] = (j*k)
Where is: = e (2i/m) = cos(2/m) + i*sin(2/m)• This is a complex number with whose mth power is 1 and is
therefore called the mth root of unity• E.g., for m = 4: = 0+1*i, = -1+0*i, = 0-1*i, = 1+0*i,
Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt
Related Transforms
• Most applications require multiplication by both F and inverse(F).
• Multiplying by F and inverse(F) are essentially the same. (inverse(F) is the complex conjugate of F divided by n.)
• For solving the Poisson equation and various other applications, we use variations on the FFT– The sin transform -- imaginary part of F
– The cos transform -- real part of F
• Algorithms are similar, so we will focus on the forward FFT.
Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt
• Compute the FFT of an m-element vector v, F*v
(F*v)[j] = F(j,k)*v(k)
= (j*k) * v(k)
= ( j)k * v(k)
= V(j)• Where V is defined as the polynomial
V(x) = xk * v(k)
Serial Algorithm for the FFT
m-1
k = 0m-1
k = 0m-1
k = 0
m-1
k = 0
Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt
Divide and Conquer FFT
• V can be evaluated using divide-and-conquer
V(x) = (x)k * v(k)
= v[0] + x2*v[2] + x4*v[4] + …
+ x*(v[1] + x2*v[3] + x4*v[5] + … )
= Veven(x2) + x*Vodd(x2)
• V has degree m, so Veven and Vodd are polynomials of degree m/2-1
• We evaluate these at points (j)2 for 0<=j<=m-1
• But this is really just m/2 different points, since
( (j+m/2) )2 = ( j *m/2) )2 = ( 2j *) = ( j)2
m-1
k = 0
Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt
Divide-and-Conquer FFT
FFT(v, v, m)
if m = 1 return v[0]
else
veven = FFT(v[0:2:m-2], 2, m/2)
vodd = FFT(v[1:2:m-1], 2, m/2)
-vec = [0, 1, … (m/2-1) ]
return [veven + (-vec .* vodd),
veven - (-vec .* vodd) ]• The .* above is component-wise multiply.• The […,…] is construction an m-element vector from 2 m/2 element vectors
This results in an O(m log m) algorithm.
precomputed
Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt
1D FFT: Butterfly Pattern
Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt
Higher Dimension FFTs
• FFTs on 2 or 3 dimensions are defined as 1D FFTs on vectors in all dimensions.
• E.g., a 2D FFT does 1D FFTs on all rows and then all columns• There are 3 obvious possibilities for the 2D FFT:
– (1) 2D blocked layout for matrix, using 1D algorithms for each row and column
– (2) Block row layout for matrix, using serial 1D FFTs on rows, followed by a transpose, then more serial 1D FFTs
– (3) Block row layout for matrix, using serial 1D FFTs on rows, followed by parallel 1D FFTs on columns
– Option 1 is best
• For a 3D FFT the options are similar– 2 phases done with serial FFTs, followed by a transpose for 3rd
– can overlap communication with 2nd phase in practice
Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt
78
2-D FFT Flowchart
Initialize matrix
Distribute matrix by rows (MPI_Scatter)
Reorganize matrix slice into square chunks
Finalize MPI
Initialize MPI environment
Compute 1-D FFTs on rows
Redistribute matrix chunks (MPI_Alltoall)
Transpose matrix chunks
Compute 1-D FFTs on rows
Collect matrix slices (MPI_Gather)
Transpose assembled matrix
Receive matrix slice (MPI_Scatter)
Reorganize matrix slice into square chunks
Finalize MPI
Initialize MPI environment
Compute 1-D FFTs on rows
Redistribute matrix chunks (MPI_Alltoall)
Transpose matrix chunks
Compute 1-D FFTs on rows
Send matrix slice (MPI_Gather)
MASTER
WORKER
Fast Fourier Transform
79
MPI - Two-Dimensional Fast Fourier Transform - C Version
•The image originates on a single processor (SOURCE_PROCESSOR). •This image, a[], is distributed by rows to all other processors. •Each processor then performs a one-dimensional FFT on the rows of the image stored locally. •The image is then transposed using the MPI_Alltoall() routine; this partitions the intermediate image by columns. •Each processor then performs a one-dimensional FFT on the columns of the image. •Finally, the columns of the image are collected back at the destination processor and the output image is tested for correctness. •Input is a 512x512 complex matrix. The input matrix is initialized with a point source. •Output is a 512x512 complex matrix that overwrites the input matrix. •Timing and Mflop results are displayed following execution. •A straightforward unsophisticated 1D FFT kernel is used. It is sufficient to convey the general idea, but be aware that there are better 1D FFTs available on many systems.
2D FFT – Code Walkthrough … 1
80
#include <stdio.h>#include <stdlib.h>#include <sys/utsname.h>#include <math.h>#include <sys/time.h>#include <time.h>#include <mpi.h>#include "mpi_2dfft.h"
#define IMAGE_SIZE 512 #define NUM_CELLS 4#define IMAGE_SLICE (IMAGE_SIZE / NUM_CELLS)#define SOURCE_PROCESSOR 0#define DEST_PROCESSOR SOURCE_PROCESSOR
int numtasks; /* Number of processors */int taskid; /* ID number for each processor */mycomplex a[IMAGE_SIZE][IMAGE_SIZE]; /* input matrix: complex numbers */mycomplex a_slice[IMAGE_SLICE][IMAGE_SIZE];mycomplex a_chunks[NUM_CELLS][IMAGE_SLICE][IMAGE_SLICE];mycomplex b[IMAGE_SIZE][IMAGE_SIZE]; /* intermediate matrix */mycomplex b_slice[IMAGE_SIZE][IMAGE_SLICE];mycomplex b_chunks[NUM_CELLS][IMAGE_SLICE][IMAGE_SLICE];mycomplex *collect;mycomplex w_common[IMAGE_SIZE/2]; /* twiddle factors */struct timevaletime[10];int checkpoint;float dt[10], sum;
2D FFT – Code Walkthrough … 2
81
int main(argc,argv)int argc;char *argv[];{ int rc, cell, i, j, n, nx, logn, errors, sign, flops; float mflops; checkpoint=0;
/** Initialize MPI environment and get task's ID and number of tasks in the partition. **/ rc = MPI_Init(&argc,&argv); rc|= MPI_Comm_size(MPI_COMM_WORLD,&numtasks); rc|= MPI_Comm_rank(MPI_COMM_WORLD,&taskid); /* Must have 4 tasks for this program */ /** Checking if numtasks is a power of 2 (in this case we have set it to 4) **/ if (numtasks != NUM_CELLS) { printf("Error: this program requires %d MPI tasks\n", NUM_CELLS); exit(1); } if (rc != MPI_SUCCESS) printf ("error initializing MPI and obtaining task ID information\n"); else printf ("MPI task ID = %d\n", taskid);
n = IMAGE_SIZE; /* compute logn and ensure that n is a power of two */ nx = n; logn = 0;
/** Checking if IMAGE_SIZE is a power of 2 **/ while(( nx >>= 1) > 0) logn++; nx = 1; for (i=0; i<logn; i++) nx = nx*2; if (nx != n) { (void)fprintf(stderr, "%d: fft size must be a power of 2\n", IMAGE_SIZE); exit(0); }
/** Initialize real and imaginary parts of array (??) **/ if (taskid == SOURCE_PROCESSOR) { for (i=0; i<n; i++) for (j=0; j<n; j++) a[i][j].r = a[i][j].i = 0.0; a[n/2][n/2].r = a[n/2][n/2].i = (float)n; /* real and imaginary array[256][256] are initialized to 512.0 and rest to 0.0 */ /* print table headings in anticipation of timing results */ printf("512 x 512 2D FFT\n"); printf(" Timings(secs)\n"); printf(" scatter 1D-FFT-row transpose 1D-FFT-col gather"); printf(" total\n"); } /* precompute the complex constants (twiddle factors) for the 1D FFTs */ for (i=0;i<n/2;i++) { w_common[i].r = (float) cos((double)((2.0*PI*i)/(float)n)); w_common[i].i = (float) -sin((double)((2.0*PI*i)/(float)n)); }
2D FFT – Code Walkthrough … 3
82
/* Distribute Input Matrix By Rows */ rc = MPI_Barrier(MPI_COMM_WORLD); if (rc != MPI_SUCCESS) { printf("Error: MPI_Barrier() failed with return code %d\n", rc); return(-1); } gettimeofday(&etime[checkpoint++], (struct timeval*)0);
/* IMAGE_SLICE = dimension of slice of image per process Each slice of image is delivered to corresponding process using MPI_Scatter() */
rc = MPI_Scatter((char *) a, IMAGE_SLICE * IMAGE_SIZE * 2, MPI_FLOAT, (char *) a_slice, IMAGE_SLICE * IMAGE_SIZE * 2, MPI_FLOAT, SOURCE_PROCESSOR, MPI_COMM_WORLD);
if (rc != MPI_SUCCESS) { printf("Error: MPI_Scatter() failed with return code %d\n", rc); return(-1); } gettimeofday(&etime[checkpoint++], (struct timeval*)0);
/* Perform 1-D Row FFTs *//* a_slice[ ][ ] is the buffer containing each individual image chunk. For each row in image slice this section of code computes 1D FFT */
for (i=0;i<IMAGE_SLICE;i++) fft(&a_slice[i][0], w_common, n, logn);
gettimeofday(&etime[checkpoint++], (struct timeval*)0);
2D FFT – Code Walkthrough … 4
83
2D FFT – Code Walkthrough … 5
84
/* Transpose 2-D image */
for(cell=0;cell<NUM_CELLS;cell++) { for(i=0;i<IMAGE_SLICE;i++) { for(j=0;j<IMAGE_SLICE;j++) { a_chunks[cell][i][j].r = a_slice[i][j + (IMAGE_SLICE * cell)].r; a_chunks[cell][i][j].i = a_slice[i][j + (IMAGE_SLICE * cell)].i; } } }/* IMAGE_SLICE * IMAGE_SLICE * 2 (because we have real and imaginary); Each component chunk is delivered to corresponding process using MPI_Alltoall() */ rc = MPI_Alltoall(a_chunks, IMAGE_SLICE * IMAGE_SLICE * 2, MPI_FLOAT, b_slice, IMAGE_SLICE * IMAGE_SLICE * 2, MPI_FLOAT, MPI_COMM_WORLD); if (rc != MPI_SUCCESS) { printf("Error: MPI_Alltoall() failed in cell %d return code %d\n", taskid, rc); return(-1); } gettimeofday(&etime[checkpoint++], (struct timeval*)0);
2D FFT – Code Walkthrough … 6
85
for(i=0;i<IMAGE_SLICE;i++) { for(j=0;j<IMAGE_SIZE;j++) { a_slice[i][j].r = b_slice[j][i].r; a_slice[i][j].i = b_slice[j][i].i; } }
/* Perform 1-D FFTs (effectively on columns) */for (i=0;i<IMAGE_SLICE;i++) fft(&a_slice[i][0], w_common, IMAGE_SIZE, logn);gettimeofday(&etime[checkpoint++], (struct timeval*)0);
/* Undistribute Output Matrix by Rows */collect = ( mycomplex *) malloc(IMAGE_SIZE * IMAGE_SIZE * sizeof( mycomplex));
/* Every process executes MPI_Gather() */rc = MPI_Gather(a_slice, IMAGE_SLICE * IMAGE_SIZE * 2, MPI_FLOAT, a, IMAGE_SLICE * IMAGE_SIZE * 2,
MPI_FLOAT, DEST_PROCESSOR, MPI_COMM_WORLD);if (rc != MPI_SUCCESS){ printf("Error: MPI_Gather() failed with return code %d\n", rc); fflush(stdout);}
2D FFT – Code Walkthrough … 7
86
/* If destination processor then perform another transpose of a[ ][ ] into b[ ][ ]*/ if (taskid == DEST_PROCESSOR) { for(i=0;i<IMAGE_SIZE;i++) { for(j=0;j<IMAGE_SIZE;j++) { b[i][j].r = a[j][i].r; b[i][j].i = a[j][i].i; } } }
gettimeofday(&etime[checkpoint++], (struct timeval*)0); fflush(stdout);
/* Calculate event timings and flops - then print them */ for(i=1;i<checkpoint;i++) dt[i] = ((float) ((etime[i].tv_sec - etime[i-1].tv_sec) * 1000000 + etime[i].tv_usec - etime[i-1].tv_usec)) / 1000000.0; printf("cell %d: ", taskid); for(i=1;i<checkpoint;i++) printf("%2.6f ", dt[i]); sum=0; for(i=1;i<checkpoint;i++) sum+=dt[i]; printf(" %2.6f \n", sum);
2D FFT – Code Walkthrough … 8
87
if (taskid == DEST_PROCESSOR) { flops = (n*n*logn)*10; mflops = ((float)flops/1000000.0); mflops = mflops/(float)sum; printf("Total Mflops= %3.4f\n", mflops); errors = 0; for (i=0;i<n;i++) { if (((i+1)/2)*2 == i) sign = 1; else sign = -1; for (j=0;j<n;j++) { if (b[i][j].r > n*sign+EPSILON || b[i][j].r < n*sign-EPSILON || b[i][j].i > n*sign+EPSILON || b[i][j].i < n*sign-EPSILON) { printf("[%d][%d] is %f,%f should be %f\n", i, j, b[i][j].r, b[i][j].i, (float) n*sign); errors++; } sign *= -1; } } if (errors) { printf("%d errors!!!!!\n", errors); exit(0); } } MPI_Finalize(); exit(0);}
2D FFT – Code Walkthrough … 9
88
fft(data,w_common,n,logn)mycomplex *data,*w_common;int n,logn;{ int incrvec, i0, i1, i2, nx; float f0, f1; void bit_reverse();
/* bit-reverse the input vector */ (void)bit_reverse(data,n);
/* do the first logn-1 stages of the fft */ i2 = logn; for (incrvec=2;incrvec<n;incrvec<<=1) { i2--; for (i0 = 0; i0 < incrvec >> 1; i0++) { for (i1 = 0; i1 < n; i1 += incrvec) { f0 = data[i0+i1 + incrvec/2].r * w_common[i0<<i2].r - data[i0+i1 + incrvec/2].i * w_common[i0<<i2].i; f1 = data[i0+i1 + incrvec/2].r * w_common[i0<<i2].i + data[i0+i1 + incrvec/2].i * w_common[i0<<i2].r; data[i0+i1 + incrvec/2].r = data[i0+i1].r - f0; data[i0+i1 + incrvec/2].i = data[i0+i1].i - f1; data[i0+i1].r = data[i0+i1].r + f0; data[i0+i1].i = data[i0+i1].i + f1; } } }
2D FFT – Code Walkthrough … 10
89
/* do the last stage of the fft */ for (i0 = 0; i0 < n/2; i0++) { f0 = data[i0 + n/2].r * w_common[i0].r - data[i0 + n/2].i * w_common[i0].i; f1 = data[i0 + n/2].r * w_common[i0].i + data[i0 + n/2].i * w_common[i0].r; data[i0 + n/2].r = data[i0].r - f0; data[i0 + n/2].i = data[i0].i - f1; data[i0].r = data[i0].r + f0; data[i0].i = data[i0].i + f1; }}/* bit_reverse - simple (but somewhat inefficient) bit reverse */void bit_reverse(a,n)mycomplex *a;int n;{ int i,j,k; j = 0; for (i=0; i<n-2; i++){ if (i < j) { SWAP(a[j],a[i]); } k = n>>1; while (k <= j) { j -= k; k >>= 1; } j += k; }}
FFT Header File
90
/*************************************************************************** * FILE: mpi_2dfft.h * DESCRIPTION: see mpi_2dfft.c * AUTHOR: George Gusciora * LAST REVISED: ***************************************************************************/#define MAXN 2048 /* max 2d fft size */#define EPSILON 0.00001 /* for comparing fp numbers */#define PI 3.14159265358979 /* 4*atan(1.0) */
typedef struct {float r,i;} mycomplex;
/* swap a pair of complex numbers */#define SWAP(a,b) {float swap_temp=(a).r;(a).r=(b).r;(b).r=swap_temp;\
swap_temp=(a).i;(a).i=(b).i;(b).i=swap_temp;}
/* swap a pair of complex numbers */#define MYSWAP(a,b) {float swap_temp=a;a=b;b=swap_temp;}
91
Topics
• Introduction
• Midterm Exam Review
• Matrix Multiplication
• N-Body Problem
• Fast Fourier Transform (FFT)
• Summary – Materials for Test
92
Summary – Material for the Test
• Introduction – Slides: 4, 5, 6• Matrix Multiply basic algorithm – Slides: 49 – 54 • Nbody – • FFT -
93