Top Banner
High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University March 6 st , 2007
93

High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Jan 04, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

High Performance Computing: Concepts, Methods & Means

Parallel Algorithms 2

Prof. Thomas SterlingDepartment of Computer Science

Louisiana State University

March 6st, 2007

Page 2: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

2

Topics

• Introduction

• Midterm Exam Review

• Matrix Multiplication

• N-Body Problem

• Fast Fourier Transform (FFT)

• Summary – Materials for Test

Page 3: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

3

Topics

• Introduction• Midterm Exam Review

• Matrix Multiplication

• N-Body Problem

• Fast Fourier Transform (FFT)

• Summary – Materials for Test

Page 4: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Half-Way Through (almost)

• More of more of the same• Today: some basic algorithms (in MPI)

– Matrix-matrix multiply– N-body– FFT

• But first: a brief walk through in preparation for the Midterm exam! (Good Luck)

Page 5: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

5

Topics

• Introduction

• Midterm Exam Review• Matrix Multiplication

• N-Body Problem

• Fast Fourier Transform (FFT)

• Summary – Materials for Test

Page 6: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

How to Prepare for Midterm

• Closed Book exam• Will look like a problem set (use as a template)• Study aids:

– Summary slide at the end of each lecture– Problem sets

• Will emphasize– Basic knowledge– Skills– Performance models

• Note:– To be held in room 338 Johnston Hall– 1 hour 15 minutes

– Bring a calculator (know how to use it)

Page 7: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

HPC in Overview (1st half)• Supercomputing evolves as interplay

– Device technology– Computer architecture– Execution models and programming methods

• Three classes of parallel computing– Capacity– Cooperative– Capability

• Three execution models– Throughput – Shared memory multithreaded– Communicating sequential processes (message passing)

• Three programming formalisms– Condor– OpenMP– MPI

• Performance modeling and measurement– Metrics– Models– Measurement tools

Page 8: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

S1 – L3 - Benchmarking

• Basic performance metrics (slide 4)• Definition of benchmark in own words; purpose of

benchmarking; properties of good benchmark (slides 5, 6, 7)

• Linpack: what it is, what does it measure, concepts and complexities (slides 15, 17, 18)

• HPL: algorithms and concepts (slides 21 through 24)• Linpack compare and contrast (slide 25)• General knowledge about HPCC and NPB suites

(slides 31, 34, 35)• Benchmark result interpretation (slides 49, 50)

8

Page 9: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

S1 – L4 : Capacity Computing• Understand material on slide (4,5),(7,8)• Understand example detailed in slides 17, 18• Understand (19) and be able to derive (20,21), (22,

23)• Understand Condor concepts detailed in slides

30,31,32• Condor Commands (37-47) : know what the basic

commands are, what they do and interpret output presented by them etc. (No need to memorize command-line options)

• Understand issues listed on slide 53• Required reading materials :

– http://www.cct.lsu.edu/~cdekate/7600/beowulf-chapter-rev1.pdf

– Specific pages to focus on : 3-16 9

Page 10: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

10

S2 – L1 : Architecture

• Need to know content on slides 11, 15, 22, 23, 33 • Understand how each of the technologies listed

on slide 7 affects performance• Understand concepts on slides 8,9• Understand concepts on slides 17, 18, 20 • Understand pipelining concepts and equations

detailed in slides 27, 28• Understand vector processing concepts and

equations detailed in slides 29, 30

10

Page 11: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

11

S2 – L2 : SMP

• Please make sure that you have addressed all points outlined on slide 5

• Understand content on slide 7• Understand concepts, equations, problems

on slides 11, 12, 13• Understand content on 21, 24, 26, 29• Understand concepts on slides 32, 33• Understand content on slides 36, 55

• Required reading material :

http://arstechnica.com/articles/paedia/hardware/pcie.ars/1

Page 12: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

S2 – L3 : PThreads

• Performance & cpi: slide 8• Multi thread concepts: 13, 16, 18, 19, 22, 24, 31• Thread implementations: 35 – 37• Pthreads: 43 – 45, 48, 55

Page 13: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

S2 – L4 : Open MP

• Components: 6• Compiling: 9, 12• Environment variables: 13, 14• Top level: 15• Shared data: 18, 19, 20• Parallel Flow Control: 23, 24, 25• Synchronization: 32, 34• Performance: 39• Synopsis: 44

Page 14: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

S2 – L5 : Performance 2

• Measuring System Operation slides: 11, 13, 17• Gprof slides: 21, 22• Perfsuite slides: 25, 29• PAPI slides: 33 – 36 (inclusive)• Tau slides: 56 – 60 (inclusive)

Page 15: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

15

S3 – L1 : Communicating Sequential Processes

• Basics: slides 6 – 9, 16

• CSP: slides 19

• Unix: slides 24, 28 - 30

Page 16: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

S3 – L2 : MPI

• MPI standard: slides 4,7• Compile and Run an MPI Program: slides 10,11 • Environment functions: slides 12,14• Point-to-point functions: slides 27,28• Blocking vs. nonblocking: slides 25,26• Deadlock: slides 29-31• Basic collective functions: slides

33,34,36,38,40,41,43

Page 17: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

17

S3 – L3 : Performance 3

• Essential MPI - Slide: 9• Performance Models - Slide: 12, 15, 16, 18 (Hockney)• LogP - Slide: 20 – 23• Effective Bandwidth – Slide: 30• Tau/MPI – Slide: 41, 43

Page 18: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

18

S3 – L4 : Parallel Algorithms

• Introduction – Slides: 4, 5, 6• Array decomposition – Slides: 11, 12• Mandelbrot load balancing – Slides: 25, 26• Monte Carlo create Communicators – Slides: 40, 42

Page 19: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

System level Overview• Understand the 3 classes of parallel computing (capacity,

cooperative, capacity).• Software System

– Understand the Software stack (eg OS, Compilers… ) used in various supercomputers

– Conceptual understanding of different Parallel Programming models (eg. shared memory, message passing …), advantages and disadvantages of each system.

• Computer Architecture– Understand and be able to discuss different sources of performance

degradation (latency, overhead…etc)

– Understand Amdahl’s Law and be able to solve problems related to the same, as well as scalability, efficiency, and cpi

– Understand and be able to describe the different forms of Hardware parallelism (pipelining, ILP, multiprocessors (SIMD,MIMD etc)

– Understand numerical problems provided in section 1 Problem Sets(1,2,3) and associated equations & theory behind them 19

Page 20: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Execution Models

• Throughput Execution Model (eg. Condor) – Be aware of the various condor commands– Thoroughly understand core Condor concepts (eg. Class Ads

and Matchmaking), and how these concepts work together

• Shared Memory multithreaded (eg. OpenMP)– Understand sources of contention (Race Condition) and how to

resolve them (Critical Sections etc… )– Understand various OpenMP constructs and how they work.

(eg be able to answer questions like how and when to use “critical” construct and its performance implication etc. )

– Understand the concept of Shared, Private and Reduction variables.

– Be able read and understand simple OpenMP code ( C )and be able to make conceptual changes where and when asked.

20

Page 21: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Execution Models and Performance

• Communicating Sequential Processes– conceptual understanding of CSP– Know the meaning of various MPI constructs common usage– Understand the fundamental concepts like deadlock and how to

resolve them– Be able to read a small code snippet and correct conceptual ( NOT

SYNTACTICAL ) problems. You donot need to memorize SYNTAX of MPI constructs.

• Performance & Benchmarking– Be aware of the Top500 list, and benchmarks used – be aware of the different benchmarks, what each of them stress

(linpack, HPL, different components of HPCC …)– be aware of the different performance tools discussed in class and

what they measure.– Understand and be able to solve Problems related to LogP Model

21

Page 22: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Key Terms and Concepts

• Speedup : Relative reduction of execution time of a fixed size workload through parallel execution

• Efficiency : Ratio of the actual performance to the best possible performance.

22

processorsNontimeexecution

processoroneontimeexecutionSpeedup

____

____

)______(

____

processorsofnumberprocessorsmultipleontimeexecution

processoroneontimeexecutionEfficiency

Page 23: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Ideal Speedup Example

23

W

220

w1 w210 210

P28

210 210 210 210

Processors

212

P1

T(1)=220

T(28)=212

812

20

22

2Speedup

1222

2 0812

20

Efficiency

Units : steps

i

iwW

Page 24: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Ideal Speedup Issues

24

• W is total workload measured in elemental pieces of work (e.g. operations, instructions, etc.)

• T(p) is total execution time measured in elemental time steps (e.g. clock cycles) where p is # of execution sites (e.g. processors, threads)

• wi is work for a given task i

• Example: here we divide a million (really Mega) operation workload, W, in to a thousand tasks, w1 to w1024 each of a 1 K operations

• Assume 256 processors performing workload in parallel

• T(256) = 4096 steps, speedup = 256, Eff = 1

Page 25: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Amdahl’s Law

gf

f

S

Tgf

Tf

TS

Tg

fTfT

TTf

TTS

S

f

g

T

T

T

OO

O

OOA

OF

AO

F

A

O

1

1

1

1

appliedon acceleratin with computatio of up speed

daccelerate be n tocomputatio daccelerate-non offraction

ncomputatio ofportion dacceleratefor gain eperformancpeak

daccelerate becan n that computatio ofportion of time

ncomputatio dacceleratefor time

ncomputatio daccelerate-nonfor time

start end

TO

TF

start end

TA

TF/g

Page 26: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Overhead

26

vv ww

W=4v+4wW=4v+4w

P

Wwi

wvT

PW

P

WPP

PW

W

PWW

T

TS

P

1

11

v = overheadw = work unitW = Total workTi = execution time with i processorsP = # processors

P

iiwW

1

Assumption : Workload is infinitely divisible

Page 27: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Scalability & Overhead

27

gg

P

gP

gg

ggP

gg

wv

P

wv

PW

W

T

TS

w

v

P

WT

w

v

P

Wvw

Pw

Wvw

P

JT

WvWT

w

W

w

WtasksJ

11

1

1)(

#

1

1 when W >> v

v = overheadwg = work unitW = Total workTi = execution time with i processorsP = # ProcessorsJ = # Tasks

Page 28: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Scalability and Overhead for fixed sized work tasks

28

• W is divided in to J wg sized tasks

• Each task requires v overhead work to manage• For P processors there are approximates J/P tasks

to be performed in sequence so,

• TP is J(wg + v)/P

• Note that S = T1 / TP• So, S = P / (1 + v / wg)

Page 29: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Measuring LogP Parameters

• Finding L+2*o

– Proc 0: (MPI_Send() then MPI_Recv()) x N

– Proc 1: (MPI_Recv() then MPI_Send()) x N

– L+2*o = total time/N

Figure 1: Time diagram for benchmark 1(a) is Time diagram of processor 0(b) is Time diagram of processor 1

Page 30: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Measuring LogP Parameters

• Finding o

– Proc 0: (MPI_Send() then some_work then MPI_Recv() ) x N

– Proc 1: (MPI_Recv() then MPI_Send() then some_work) x N

– o = (1/2)total time/N – time(some_work)

– requires time(some_work) > 2*L+2*o

Figure 2: Time diagram for benchmark 2 with X > 2*L+Or+Os(a) is Time diagram of processor 1(b) is Time diagram of processor 2

Page 31: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Performance Metrics

• Peak floating point operations per second (flops)• Peak instructions per second (ips)• Sustained throughput

– flops, Mflops, Gflops, Tflops, Pflops– flops, Megaflops, Gigaflops, Teraflops, Petaflops– ips, Mips, …

• Cycles per instruction– cpi – Alternatively: instructions per cycle, ipc

• Memory access latency– cycles per second

• Memory access bandwidth– bytes per second– or Gigabytes per second, GBps, GB/s

• Bi-section bandwidth– bytes per second

31

Page 32: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

32

CPI (continued)

cMmissmissMhitmissMRR

MmissmissMhitmissM

MRMMRR

MM

RR

c

tcpircpirmcpimI# T

cpircpircpi

mmcpimcpimcpi

I#I#m

I#I#m

tcpiI#T

1

1

0.1 where

Page 33: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

33

Basic Performance Metrics• Time related:

– Execution time [seconds]• wall clock time• system and user time

– Latency– Response time

• Rate related:– Rate of computation

• floating point operations per second [flops]• integer operations per second [ops]

– Data transfer (I/O) rate [bytes/second]• Effectiveness:

– Efficiency [%]– Memory consumption [bytes]– Productivity [utility/($*second)]

• Modifiers:– Sustained– Peak– Theoretical peak

Page 34: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

34

Basic Parallel (MPI) Program Steps

• Establish logical bindings• Initialize application execution environment• Distribute data and work• Perform core computations in parallel (across nodes)• Synchronize and Exchange intermediate data results

– Optional for non-embarrassingly parallel (cooperative)

• Detect “stop” condition– Maybe implicit with a barrier etc.

• Aggregate final results– Often a reduction operator

• Output results and error code• Terminate and return to OS

Page 35: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

The Essential MPI

• API Elements : – MPI_Init(), MPI_Finalize()– MPI_Comm_size(), MPI_Comm_rank() – MPI_COMM_WORLD– Error checking using MPI_SUCCESS– MPI basic data types (slide 27)– Blocking : MPI_Send(), MPI_Recv()– Non-Blocking : MPI_Isend(), MPI_Irecv(), MPI_Wait()– Collective Calls : MPI_Barrier(), MPI_Bcast(), MPI_Gather(),

MPI_Scatter(), MPI_Reduce()

• Commands : – Running MPI Programs : mpirun– Compile : mpicc – Compile : mpif77

Page 36: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

36

Topics

• Introduction

• Midterm Exam Review

• Matrix Multiplication• N-Body Problem

• Fast Fourier Transform (FFT)

• Summary – Materials for Test

Page 37: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

37

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Matrices — A ReviewAn n x m matrix

Page 38: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

38

Matrix Multiplication

Multiplication of two matrices, A and B, produces the matrix Cwhose elements, ci,j (0 <= i < n, 0 <= j < m), are computed as follows:

where A is an n x l matrix and B is an l x m matrix.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 39: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

39

Matrix multiplication, C = A x B

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 40: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

40

Implementing Matrix MultiplicationSequential Code

Assume throughout that the matrices are square (n x n matrices).The sequential code to compute A x B could simply be

for (i = 0; i < n; i++)for (j = 0; j < n; j++) {

c[i][j] = 0;for (k = 0; k < n; k++)

c[i][j] = c[i][j] + a[i][k] * b[k][j]; }

This algorithm requires n3 multiplications and n3 additions, leading to a sequential time complexity of O(n3). Very easy to parallelize.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 41: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

41

Block Matrix Multiplication

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 42: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

42Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 43: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

43

Performance Improvement

Using tree construction n numbers can be added in log n steps using n processors:

Computational timecomplexity of O(log n)using n3 processors.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Page 44: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

44

Flowchart for Matrix Multiplication“master” “workers”

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment

… Initialize MPI EnvironmentInitialize MPI Environment

Initialize ArrayInitialize Array

Partition Array into workloads Partition Array into workloads

Send Workload to “workers”

Send Workload to “workers”

Recv. workRecv. work Recv. workRecv. work … Recv. workRecv. work

wait for “workers“ to finish task

wait for “workers“ to finish task

Calculate matrix product

Calculate matrix product

Calculate matrix product

Calculate matrix product

Calculate matrix product

Calculate matrix product…

Send resultSend result Send resultSend result … Send resultSend result

Recv. resultsRecv. results

Print resultsPrint results

EndEnd

Page 45: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

45

Matrix Multiplication (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>

#define NRA 62 /* number of rows in matrix A */#define NCA 15 /* number of columns in matrix A */#define NCB 7 /* number of columns in matrix B */#define MASTER 0 /* taskid of first task */#define FROM_MASTER 1 /* setting a message type */#define FROM_WORKER 2 /* setting a message type */

int main(argc,argv)int argc;char *argv[];{int numtasks, /* number of tasks in partition */

taskid, /* a task identifier */numworkers, /* number of worker tasks */source, /* task id of message source */dest, /* task id of message destination */mtype, /* message type */rows, /* rows of matrix A sent to each worker */averow, extra, offset, /* used to determine rows sent to each worker */i, j, k, rc; /* misc */

double a[NRA][NCA], /* matrix A to be multiplied */b[NCA][NCB], /* matrix B to be multiplied */c[NRA][NCB]; /* result matrix C */

MPI_Status status;

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

Page 46: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

46

Matrix Multiplication (source code)/* Initialize MPI Environment */

MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&taskid);MPI_Comm_size(MPI_COMM_WORLD,&numtasks);if (numtasks < 2 ) { printf("Need at least two MPI tasks. Quitting...\n"); MPI_Abort(MPI_COMM_WORLD, rc); exit(1); }numworkers = numtasks-1;

/* Master block*/ if (taskid == MASTER) { printf("mpi_mm has started with %d tasks.\n",numtasks); printf("Initializing arrays...\n"); for (i=0; i<NRA; i++) for (j=0; j<NCA; j++) a[i][j]= i+j; /* Initialize array a */ for (i=0; i<NCA; i++) for (j=0; j<NCB; j++) b[i][j]= i*j; /* Initialize array b */ /* Send matrix data to the worker tasks */ averow = NRA/numworkers; /* determining fraction of array to be processed by “workers” */ extra = NRA%numworkers; offset = 0; mtype = FROM_MASTER; /* Message Tag */

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

Page 47: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

47

Matrix Multiplication (source code) for (dest=1; dest<=numworkers; dest++) { /* To each worker send : Start point, number of rows to process, and sub-arrays to process */ rows = (dest <= extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset); MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); offset = offset + rows; }

/* Receive results from worker tasks */ mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */ for (i=1; i<=numworkers; i++) { source = i;

/* offset stores the (processing) starting point of work chunk */ MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);

/* The array C contains the product of sub-array A and the array B */ MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status); printf("Received results from task %d\n",source); } printf("******************************************************\n"); printf("Result Matrix:\n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", c[i][j]); } printf("\n******************************************************\n"); printf ("Done.\n"); }

Page 48: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

48

Matrix Multiplication (source code)

/**************************** worker task ************************************/ if (taskid > MASTER) { mtype = FROM_MASTER; MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);

for (k=0; k<NCB; k++) for (i=0; i<rows; i++) { c[i][k] = 0.0; for (j=0; j<NCA; j++)

/* Calculate the product and store result in C */ c[i][k] = c[i][k] + a[i][j] * b[j][k]; } mtype = FROM_WORKER; MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);

/* Worker sends the resultant array to the master */ MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD); } MPI_Finalize();}

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

Page 49: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

49

Demo : Matrix Multiplication[cdekate@compute-0-6 matrix_multiplication]$ mpirun -np 4 ./mpi_mmmpi_mm has started with 4 tasks.Initializing arrays...Sending 21 rows to task 1 offset=0Sending 21 rows to task 2 offset=21Sending 20 rows to task 3 offset=42Received results from task 1Received results from task 2Received results from task 3******************************************************Result Matrix:

0.00 1015.00 2030.00 3045.00 4060.00 5075.00 6090.00 0.00 1120.00 2240.00 3360.00 4480.00 5600.00 6720.00 0.00 1225.00 2450.00 3675.00 4900.00 6125.00 7350.00 0.00 1330.00 2660.00 3990.00 5320.00 6650.00 7980.00 0.00 1435.00 2870.00 4305.00 5740.00 7175.00 8610.00 0.00 1540.00 3080.00 4620.00 6160.00 7700.00 9240.00 0.00 1645.00 3290.00 4935.00 6580.00 8225.00 9870.00 ……… 0.00 6475.00 12950.00 19425.00 25900.00 32375.00 38850.00 0.00 6580.00 13160.00 19740.00 26320.00 32900.00 39480.00 0.00 6685.00 13370.00 20055.00 26740.00 33425.00 40110.00 0.00 6790.00 13580.00 20370.00 27160.00 33950.00 40740.00 0.00 6895.00 13790.00 20685.00 27580.00 34475.00 41370.00 0.00 7000.00 14000.00 21000.00 28000.00 35000.00 42000.00 0.00 7105.00 14210.00 21315.00 28420.00 35525.00 42630.00 0.00 7210.00 14420.00 21630.00 28840.00 36050.00 43260.00 0.00 7315.00 14630.00 21945.00 29260.00 36575.00 43890.00 0.00 7420.00 14840.00 22260.00 29680.00 37100.00 44520.00 ******************************************************Done.[cdekate@compute-0-6 matrix_multiplication]$

Page 50: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

50

Topics

• Introduction

• Midterm Exam Review

• Matrix Multiplication

• N-Body Problem• Fast Fourier Transform (FFT)

• Summary – Materials for Test

Page 51: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

51

N Bodies

OU Supercomputing Center for Education & Research

Page 52: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

52OU Supercomputing Center for Education & ResearchImg src : http://www.lsbu.ac.uk/water

N-Body Problems

An N-body problem is a problem involving N “bodies” – that is, particles (e.g., stars, atoms) – each of which applies a force to all of the others.

For example, if you have N stars, then each of the N stars exerts a force (gravity) on all of the other N–1 stars.

Likewise, if you have N atoms, then every atom exerts a force on all of the other N–1 atoms. The forces are Coulombic and van der Waal’s.

Page 53: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

53

2-Body Problem

When N is 2, you have – surprise! – a 2-Body Problem: exactly two particles, each exerting a force that acts on the other.

The relationship between the 2 particles can be expressed as a differential equation that can be solved analytically, producing a closed-form solution.

So, given the particles’ initial positions and velocities, you can immediately calculate their positions and velocities at any later time.

OU Supercomputing Center for Education & Research

Page 54: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

54

N-Body Problems

For N of 3 or more, no one knows how to solve the equations to get a closed form solution.

So, numerical simulation is pretty much the only way to study groups of 3 or more bodies.

Popular applications of N-body codes include astronomy and chemistry.

Note that, for N bodies, there are on the order of N2 forces, denoted O(N2).

OU Supercomputing Center for Education & Research

Page 55: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

55

N-Body Problems

Given N bodies, each body exerts a force on all of the other N–1 bodies.

Therefore, there are N • (N–1) forces in total.

You can also think of this as (N • (N–1))/2 forces, in the sense that the force from particle A to particle B is the same (except in the opposite direction) as the force from particle B to particle A.

OU Supercomputing Center for Education & Research

Page 56: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

56

N-Body Problems

Given N bodies, each body exerts a force on all of the other N–1 bodies.

Therefore, there are N • (N–1) forces in total.

In Big-O notation, that’s O(N2) forces to calculate.

So, calculating the forces takes O(N2) time to execute.

But, there are only N particles, each taking up the same amount of memory, so we say that N-body codes are of:

• O(N) spatial complexity (memory)• O(N2) time complexity

OU Supercomputing Center for Education & Research

Page 57: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

57

O(N2) Forces

Note that this picture shows only the forces between A and everyone else.

A

OU Supercomputing Center for Education & Research

Page 58: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

58

How to Calculate?

Whatever your physics is, you have some function, F(A,B), that expresses the force between two bodies A and B.

For example,

F(A,B) = G · mA · mB / dist(A,B)2 where G is the gravitational constant and m is the mass of the

particle in question.If you have all of the forces for every pair of particles, then you can

calculate their sum, obtaining the force on every particle.

OU Supercomputing Center for Education & Research

Page 59: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

59

How to Parallelize?

Okay, so let’s say you have a nice serial (single-CPU) code that does an N-body calculation.

How are you going to parallelize it?You could:• have a master feed particles to processes;• have a master feed interactions to processes;• have each process decide on its own subset of the

particles, and then share around the forces;• have each process decide its own subset of the

interactions, and then share around the forces.

OU Supercomputing Center for Education & Research

Page 60: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

60

Do You Need a Master?

Let’s say that you have N bodies, and therefore you have ½N(N-1) interactions (every particle interacts with all of the others, but you don’t need to calculate both A B and B A).

Do you need a master?

Well, can each processor determine on its own either (a) which of the bodies to process, or (b) which of the interactions?

If the answer is yes, then you don’t need a master.

OU Supercomputing Center for Education & Research

Page 61: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

61

N-Body “Pipeline” Implementation Flowchart

Create ring communicator

Initialize particle parameters

Copy local particle data to send buffer

Update positions of local particles

All iterations done?

Finalize MPI

N

Y

Initiate transmission of send buffer to the RIGHT neighbor in ring

Initiate reception of data from the LEFT neighbor in ring

Compute forces between local and send buffer particles

Processed particles from all remote nodes?

N

Wait for message exchange to complete

Copy particle data from receive buffer to send buffer

Y

Initialize MPI environment

Page 62: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

62

N-Body (source code)

#include "mpi.h"#include <stdlib.h>#include <stdio.h>#include <string.h>#include <math.h>

/* Pipeline version of the algorithm... *//* we really need the velocities as well… */

/* Simplified structure describing parameters of a single particle */typedef struct { double x, y, z; double mass; } Particle;/* We use leapfrog for the time integration ... */

/* Structure to hold force components and old position coordinates of a particle */typedef struct { double xold, yold, zold; double fx, fy, fz; } ParticleV;

void InitParticles( Particle[], ParticleV [], int );double ComputeForces( Particle [], Particle [], ParticleV [], int );double ComputeNewPos( Particle [], ParticleV [], int, double, MPI_Comm );

#define MAX_PARTICLES 4000#define MAX_P 128

Page 63: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

63

N-Body (source code)

main( int argc, char *argv[] ){ Particle particles[MAX_PARTICLES]; /* Particles on ALL nodes */ ParticleV pv[MAX_PARTICLES]; /* Particle velocity */ Particle sendbuf[MAX_PARTICLES], /* Pipeline buffers */

recvbuf[MAX_PARTICLES]; MPI_Request request[2]; int counts[MAX_P], /* Number on each processor */ displs[MAX_P]; /* Offsets into particles */ int rank, size, npart, i, j,

offset; /* location of local particles */ int totpart, /* total number of particles */

cnt; /* number of times in loop */ MPI_Datatype particletype; double sim_t; /* Simulation time */ double time; /* Computation time */ int pipe, left, right, periodic; MPI_Comm commring; MPI_Status statuses[2];

/* Initialize MPI Environment */ MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size );

/* Create 1-dimensional periodic Cartesian communicator (a ring) */ periodic = 1; MPI_Cart_create( MPI_COMM_WORLD, 1, &size, &periodic, 1, &commring ); MPI_Cart_shift( commring, 0, 1, &left, &right ); /* Find the closest neighbors in ring */

Page 64: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

/* Calculate local fraction of particles */ if (argc < 2) {

fprintf( stderr, "Usage: %s n\n", argv[0] );MPI_Abort( MPI_COMM_WORLD, 1 );

} npart = atoi(argv[1]) / size; if (npart * size > MAX_PARTICLES) {

fprintf( stderr, "%d is too many; max is %d\n", npart*size, MAX_PARTICLES );MPI_Abort( MPI_COMM_WORLD, 1 );

} MPI_Type_contiguous( 4, MPI_DOUBLE, &particletype ); /* Data type corresponding to Particle struct */ MPI_Type_commit( &particletype );

/* Get the sizes and displacements */ MPI_Allgather( &npart, 1, MPI_INT, counts, 1, MPI_INT, commring ); displs[0] = 0; for (i=1; i<size; i++)

displs[i] = displs[i-1] + counts[i-1]; totpart = displs[size-1] + counts[size-1];

/* Generate the initial values */ InitParticles( particles, pv, npart); offset = displs[rank]; cnt = 10; time = MPI_Wtime(); sim_t = 0.0;

/* Begin simulation loop */ while (cnt--) {

double max_f, max_f_seg;

64

N-Body (source code)

Page 65: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

65

N-Body (source code)/* Load the initial send buffer */memcpy( sendbuf, particles, npart * sizeof(Particle) );max_f = 0.0;for (pipe=0; pipe<size; pipe++) { if (pipe != size-1) {

/* Initialize send to the “right” neighbor, while receiving from the “left” */MPI_Isend( sendbuf, npart, particletype, right, pipe, commring, &request[0] );MPI_Irecv( recvbuf, npart, particletype, left, pipe, commring, &request[1] );

} /* Compute forces */ max_f_seg = ComputeForces( particles, sendbuf, pv, npart ); if (max_f_seg > max_f) max_f = max_f_seg;

/* Wait for updates to complete and copy received particles to the send buffer */ if (pipe != size-1) MPI_Waitall( 2, request, statuses ); memcpy( sendbuf, recvbuf, counts[pipe] * sizeof(Particle) );}/* Compute the changes in position using the already calculated forces */sim_t += ComputeNewPos( particles, pv, npart, max_f, commring );

/* We could do graphics here (move particles on the display) */ } time = MPI_Wtime() - time; if (rank == 0) {

printf( "Computed %d particles in %f seconds\n", totpart, time ); } MPI_Finalize(); return 0;}

Page 66: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

66

N-Body (source code)/* Initialize particle positions, masses and forces */void InitParticles( Particle particles[], ParticleV pv[], int npart ){ int i; for (i=0; i<npart; i++) {

particles[i].x = drand48();particles[i].y = drand48();particles[i].z = drand48();particles[i].mass = 1.0;pv[i].xold = particles[i].x;pv[i].yold = particles[i].y;pv[i].zold = particles[i].z;pv[i].fx = 0;pv[i].fy = 0;pv[i].fz = 0;

}}/* Compute forces (2-D only) */double ComputeForces( Particle myparticles[], Particle others[], ParticleV pv[], int npart ){ double max_f, rmin; int i, j;

max_f = 0.0; for (i=0; i<npart; i++) { double xi, yi, mi, rx, ry, mj, r, fx, fy; rmin = 100.0; xi = myparticles[i].x; yi = myparticles[i].y; fx = 0.0; fy = 0.0;

Page 67: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

67

N-Body (source code)for (j=0; j<npart; j++) { rx = xi - others[j].x; ry = yi - others[j].y; mj = others[j].mass; r = rx * rx + ry * ry; /* ignore overlap and same particle */ if (r == 0.0) continue; if (r < rmin) rmin = r; /* compute forces */ r = r * sqrt(r); fx -= mj * rx / r; fy -= mj * ry / r; } pv[i].fx += fx; pv[i].fy += fy; /* Compute a rough estimate of (1/m)|df / dx| */ fx = sqrt(fx*fx + fy*fy)/rmin; if (fx > max_f) max_f = fx; } return max_f;}

/* Update particle positions (2-D only) */double ComputeNewPos( Particle particles[], ParticleV pv[], int npart, double max_f, MPI_Comm commring ){ int i; double a0, a1, a2; static double dt_old = 0.001, dt = 0.001; double dt_est, new_dt, dt_new;

Page 68: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

68

N-Body (source code)/* integation is a0 * x^+ + a1 * x + a2 * x^- = f / m */ a0 = 2.0 / (dt * (dt + dt_old)); a2 = 2.0 / (dt_old * (dt + dt_old)); a1 = -(a0 + a2); /* also -2/(dt*dt_old) */ for (i=0; i<npart; i++) { double xi, yi; /* Very, very simple leapfrog time integration. We use a variable step version to simplify time-step control. */ xi = particles[i].x; yi = particles[i].y; particles[i].x = (pv[i].fx - a1 * xi - a2 * pv[i].xold) / a0; particles[i].y = (pv[i].fy - a1 * yi - a2 * pv[i].yold) / a0; pv[i].xold = xi; pv[i].yold = yi; pv[i].fx = 0; pv[i].fy = 0; } /* Recompute a time step. Stability criteria is roughly 2/sqrt(1/m |df/dx|) >= dt. We leave a little room */ dt_est = 1.0/sqrt(max_f); if (dt_est < 1.0e-6) dt_est = 1.0e-6; MPI_Allreduce( &dt_est, &dt_new, 1, MPI_DOUBLE, MPI_MIN, commring ); /* Modify time step */ if (dt_new < dt) { dt_old = dt; dt = dt_new; } else if (dt_new > 4.0 * dt) { dt_old = dt; dt *= 2.0; } return dt_old;}

Page 69: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

69

Demo : N-Body Problem

> mpirun –np 4 nbodypipe 4000Computed 4000 particles in 1.119051 seconds> mpirun –np 4 nbodypipe 4000Computed 4000 particles in 1.119051 seconds

Page 70: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

70

Topics

• Introduction

• Midterm Exam Review

• Matrix Multiplication

• N-Body Problem

• Fast Fourier Transform (FFT)• Summary – Materials for Test

Page 71: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Serial FFT

• Let i = sqrt(-1) and index matrices and vectors from 0.• The Discrete Fourier Transform of an m-element vector v is: F*v Where F is the m×m matrix defined as: F[j,k] = (j*k)

Where is: = e (2i/m) = cos(2/m) + i*sin(2/m)• This is a complex number with whose mth power is 1 and is

therefore called the mth root of unity• E.g., for m = 4: = 0+1*i, = -1+0*i, = 0-1*i, = 1+0*i,

Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt

Page 72: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Related Transforms

• Most applications require multiplication by both F and inverse(F).

• Multiplying by F and inverse(F) are essentially the same. (inverse(F) is the complex conjugate of F divided by n.)

• For solving the Poisson equation and various other applications, we use variations on the FFT– The sin transform -- imaginary part of F

– The cos transform -- real part of F

• Algorithms are similar, so we will focus on the forward FFT.

Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt

Page 73: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

• Compute the FFT of an m-element vector v, F*v

(F*v)[j] = F(j,k)*v(k)

= (j*k) * v(k)

= ( j)k * v(k)

= V(j)• Where V is defined as the polynomial

V(x) = xk * v(k)

Serial Algorithm for the FFT

m-1

k = 0m-1

k = 0m-1

k = 0

m-1

k = 0

Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt

Page 74: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Divide and Conquer FFT

• V can be evaluated using divide-and-conquer

V(x) = (x)k * v(k)

= v[0] + x2*v[2] + x4*v[4] + …

+ x*(v[1] + x2*v[3] + x4*v[5] + … )

= Veven(x2) + x*Vodd(x2)

• V has degree m, so Veven and Vodd are polynomials of degree m/2-1

• We evaluate these at points (j)2 for 0<=j<=m-1

• But this is really just m/2 different points, since

( (j+m/2) )2 = ( j *m/2) )2 = ( 2j *) = ( j)2

m-1

k = 0

Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt

Page 75: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Divide-and-Conquer FFT

FFT(v, v, m)

if m = 1 return v[0]

else

veven = FFT(v[0:2:m-2], 2, m/2)

vodd = FFT(v[1:2:m-1], 2, m/2)

-vec = [0, 1, … (m/2-1) ]

return [veven + (-vec .* vodd),

veven - (-vec .* vodd) ]• The .* above is component-wise multiply.• The […,…] is construction an m-element vector from 2 m/2 element vectors

This results in an O(m log m) algorithm.

precomputed

Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt

Page 76: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

1D FFT: Butterfly Pattern

Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt

Page 77: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Higher Dimension FFTs

• FFTs on 2 or 3 dimensions are defined as 1D FFTs on vectors in all dimensions.

• E.g., a 2D FFT does 1D FFTs on all rows and then all columns• There are 3 obvious possibilities for the 2D FFT:

– (1) 2D blocked layout for matrix, using 1D algorithms for each row and column

– (2) Block row layout for matrix, using serial 1D FFTs on rows, followed by a transpose, then more serial 1D FFTs

– (3) Block row layout for matrix, using serial 1D FFTs on rows, followed by parallel 1D FFTs on columns

– Option 1 is best

• For a 3D FFT the options are similar– 2 phases done with serial FFTs, followed by a transpose for 3rd

– can overlap communication with 2nd phase in practice

Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt

Page 78: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

78

2-D FFT Flowchart

Initialize matrix

Distribute matrix by rows (MPI_Scatter)

Reorganize matrix slice into square chunks

Finalize MPI

Initialize MPI environment

Compute 1-D FFTs on rows

Redistribute matrix chunks (MPI_Alltoall)

Transpose matrix chunks

Compute 1-D FFTs on rows

Collect matrix slices (MPI_Gather)

Transpose assembled matrix

Receive matrix slice (MPI_Scatter)

Reorganize matrix slice into square chunks

Finalize MPI

Initialize MPI environment

Compute 1-D FFTs on rows

Redistribute matrix chunks (MPI_Alltoall)

Transpose matrix chunks

Compute 1-D FFTs on rows

Send matrix slice (MPI_Gather)

MASTER

WORKER

Page 79: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Fast Fourier Transform

79

MPI - Two-Dimensional Fast Fourier Transform - C Version

•The image originates on a single processor (SOURCE_PROCESSOR). •This image, a[], is distributed by rows to all other processors. •Each processor then performs a one-dimensional FFT on the rows of the image stored locally. •The image is then transposed using the MPI_Alltoall() routine; this partitions the intermediate image by columns. •Each processor then performs a one-dimensional FFT on the columns of the image. •Finally, the columns of the image are collected back at the destination processor and the output image is tested for correctness. •Input is a 512x512 complex matrix. The input matrix is initialized with a point source. •Output is a 512x512 complex matrix that overwrites the input matrix. •Timing and Mflop results are displayed following execution. •A straightforward unsophisticated 1D FFT kernel is used. It is sufficient to convey the general idea, but be aware that there are better 1D FFTs available on many systems.

Page 80: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

2D FFT – Code Walkthrough … 1

80

#include <stdio.h>#include <stdlib.h>#include <sys/utsname.h>#include <math.h>#include <sys/time.h>#include <time.h>#include <mpi.h>#include "mpi_2dfft.h"

#define IMAGE_SIZE 512 #define NUM_CELLS 4#define IMAGE_SLICE (IMAGE_SIZE / NUM_CELLS)#define SOURCE_PROCESSOR 0#define DEST_PROCESSOR SOURCE_PROCESSOR

int numtasks; /* Number of processors */int taskid; /* ID number for each processor */mycomplex a[IMAGE_SIZE][IMAGE_SIZE]; /* input matrix: complex numbers */mycomplex a_slice[IMAGE_SLICE][IMAGE_SIZE];mycomplex a_chunks[NUM_CELLS][IMAGE_SLICE][IMAGE_SLICE];mycomplex b[IMAGE_SIZE][IMAGE_SIZE]; /* intermediate matrix */mycomplex b_slice[IMAGE_SIZE][IMAGE_SLICE];mycomplex b_chunks[NUM_CELLS][IMAGE_SLICE][IMAGE_SLICE];mycomplex *collect;mycomplex w_common[IMAGE_SIZE/2]; /* twiddle factors */struct timevaletime[10];int checkpoint;float dt[10], sum;

Page 81: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

2D FFT – Code Walkthrough … 2

81

int main(argc,argv)int argc;char *argv[];{ int rc, cell, i, j, n, nx, logn, errors, sign, flops; float mflops; checkpoint=0;

/** Initialize MPI environment and get task's ID and number of tasks in the partition. **/ rc = MPI_Init(&argc,&argv); rc|= MPI_Comm_size(MPI_COMM_WORLD,&numtasks); rc|= MPI_Comm_rank(MPI_COMM_WORLD,&taskid); /* Must have 4 tasks for this program */ /** Checking if numtasks is a power of 2 (in this case we have set it to 4) **/ if (numtasks != NUM_CELLS) { printf("Error: this program requires %d MPI tasks\n", NUM_CELLS); exit(1); } if (rc != MPI_SUCCESS) printf ("error initializing MPI and obtaining task ID information\n"); else printf ("MPI task ID = %d\n", taskid);

n = IMAGE_SIZE; /* compute logn and ensure that n is a power of two */ nx = n; logn = 0;

Page 82: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

/** Checking if IMAGE_SIZE is a power of 2 **/ while(( nx >>= 1) > 0) logn++; nx = 1; for (i=0; i<logn; i++) nx = nx*2; if (nx != n) { (void)fprintf(stderr, "%d: fft size must be a power of 2\n", IMAGE_SIZE); exit(0); }

/** Initialize real and imaginary parts of array (??) **/ if (taskid == SOURCE_PROCESSOR) { for (i=0; i<n; i++) for (j=0; j<n; j++) a[i][j].r = a[i][j].i = 0.0; a[n/2][n/2].r = a[n/2][n/2].i = (float)n; /* real and imaginary array[256][256] are initialized to 512.0 and rest to 0.0 */ /* print table headings in anticipation of timing results */ printf("512 x 512 2D FFT\n"); printf(" Timings(secs)\n"); printf(" scatter 1D-FFT-row transpose 1D-FFT-col gather"); printf(" total\n"); } /* precompute the complex constants (twiddle factors) for the 1D FFTs */ for (i=0;i<n/2;i++) { w_common[i].r = (float) cos((double)((2.0*PI*i)/(float)n)); w_common[i].i = (float) -sin((double)((2.0*PI*i)/(float)n)); }

2D FFT – Code Walkthrough … 3

82

Page 83: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

/* Distribute Input Matrix By Rows */ rc = MPI_Barrier(MPI_COMM_WORLD); if (rc != MPI_SUCCESS) { printf("Error: MPI_Barrier() failed with return code %d\n", rc); return(-1); } gettimeofday(&etime[checkpoint++], (struct timeval*)0);

/* IMAGE_SLICE = dimension of slice of image per process Each slice of image is delivered to corresponding process using MPI_Scatter() */

rc = MPI_Scatter((char *) a, IMAGE_SLICE * IMAGE_SIZE * 2, MPI_FLOAT, (char *) a_slice, IMAGE_SLICE * IMAGE_SIZE * 2, MPI_FLOAT, SOURCE_PROCESSOR, MPI_COMM_WORLD);

if (rc != MPI_SUCCESS) { printf("Error: MPI_Scatter() failed with return code %d\n", rc); return(-1); } gettimeofday(&etime[checkpoint++], (struct timeval*)0);

/* Perform 1-D Row FFTs *//* a_slice[ ][ ] is the buffer containing each individual image chunk. For each row in image slice this section of code computes 1D FFT */

for (i=0;i<IMAGE_SLICE;i++) fft(&a_slice[i][0], w_common, n, logn);

gettimeofday(&etime[checkpoint++], (struct timeval*)0);

2D FFT – Code Walkthrough … 4

83

Page 84: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

2D FFT – Code Walkthrough … 5

84

/* Transpose 2-D image */

for(cell=0;cell<NUM_CELLS;cell++) { for(i=0;i<IMAGE_SLICE;i++) { for(j=0;j<IMAGE_SLICE;j++) { a_chunks[cell][i][j].r = a_slice[i][j + (IMAGE_SLICE * cell)].r; a_chunks[cell][i][j].i = a_slice[i][j + (IMAGE_SLICE * cell)].i; } } }/* IMAGE_SLICE * IMAGE_SLICE * 2 (because we have real and imaginary); Each component chunk is delivered to corresponding process using MPI_Alltoall() */ rc = MPI_Alltoall(a_chunks, IMAGE_SLICE * IMAGE_SLICE * 2, MPI_FLOAT, b_slice, IMAGE_SLICE * IMAGE_SLICE * 2, MPI_FLOAT, MPI_COMM_WORLD); if (rc != MPI_SUCCESS) { printf("Error: MPI_Alltoall() failed in cell %d return code %d\n", taskid, rc); return(-1); } gettimeofday(&etime[checkpoint++], (struct timeval*)0);

Page 85: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

2D FFT – Code Walkthrough … 6

85

for(i=0;i<IMAGE_SLICE;i++) { for(j=0;j<IMAGE_SIZE;j++) { a_slice[i][j].r = b_slice[j][i].r; a_slice[i][j].i = b_slice[j][i].i; } }

/* Perform 1-D FFTs (effectively on columns) */for (i=0;i<IMAGE_SLICE;i++) fft(&a_slice[i][0], w_common, IMAGE_SIZE, logn);gettimeofday(&etime[checkpoint++], (struct timeval*)0);

/* Undistribute Output Matrix by Rows */collect = ( mycomplex *) malloc(IMAGE_SIZE * IMAGE_SIZE * sizeof( mycomplex));

/* Every process executes MPI_Gather() */rc = MPI_Gather(a_slice, IMAGE_SLICE * IMAGE_SIZE * 2, MPI_FLOAT, a, IMAGE_SLICE * IMAGE_SIZE * 2,

MPI_FLOAT, DEST_PROCESSOR, MPI_COMM_WORLD);if (rc != MPI_SUCCESS){ printf("Error: MPI_Gather() failed with return code %d\n", rc); fflush(stdout);}

Page 86: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

2D FFT – Code Walkthrough … 7

86

/* If destination processor then perform another transpose of a[ ][ ] into b[ ][ ]*/ if (taskid == DEST_PROCESSOR) { for(i=0;i<IMAGE_SIZE;i++) { for(j=0;j<IMAGE_SIZE;j++) { b[i][j].r = a[j][i].r; b[i][j].i = a[j][i].i; } } }

gettimeofday(&etime[checkpoint++], (struct timeval*)0); fflush(stdout);

/* Calculate event timings and flops - then print them */ for(i=1;i<checkpoint;i++) dt[i] = ((float) ((etime[i].tv_sec - etime[i-1].tv_sec) * 1000000 + etime[i].tv_usec - etime[i-1].tv_usec)) / 1000000.0; printf("cell %d: ", taskid); for(i=1;i<checkpoint;i++) printf("%2.6f ", dt[i]); sum=0; for(i=1;i<checkpoint;i++) sum+=dt[i]; printf(" %2.6f \n", sum);

Page 87: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

2D FFT – Code Walkthrough … 8

87

if (taskid == DEST_PROCESSOR) { flops = (n*n*logn)*10; mflops = ((float)flops/1000000.0); mflops = mflops/(float)sum; printf("Total Mflops= %3.4f\n", mflops); errors = 0; for (i=0;i<n;i++) { if (((i+1)/2)*2 == i) sign = 1; else sign = -1; for (j=0;j<n;j++) { if (b[i][j].r > n*sign+EPSILON || b[i][j].r < n*sign-EPSILON || b[i][j].i > n*sign+EPSILON || b[i][j].i < n*sign-EPSILON) { printf("[%d][%d] is %f,%f should be %f\n", i, j, b[i][j].r, b[i][j].i, (float) n*sign); errors++; } sign *= -1; } } if (errors) { printf("%d errors!!!!!\n", errors); exit(0); } } MPI_Finalize(); exit(0);}

Page 88: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

2D FFT – Code Walkthrough … 9

88

fft(data,w_common,n,logn)mycomplex *data,*w_common;int n,logn;{ int incrvec, i0, i1, i2, nx; float f0, f1; void bit_reverse();

/* bit-reverse the input vector */ (void)bit_reverse(data,n);

/* do the first logn-1 stages of the fft */ i2 = logn; for (incrvec=2;incrvec<n;incrvec<<=1) { i2--; for (i0 = 0; i0 < incrvec >> 1; i0++) { for (i1 = 0; i1 < n; i1 += incrvec) { f0 = data[i0+i1 + incrvec/2].r * w_common[i0<<i2].r - data[i0+i1 + incrvec/2].i * w_common[i0<<i2].i; f1 = data[i0+i1 + incrvec/2].r * w_common[i0<<i2].i + data[i0+i1 + incrvec/2].i * w_common[i0<<i2].r; data[i0+i1 + incrvec/2].r = data[i0+i1].r - f0; data[i0+i1 + incrvec/2].i = data[i0+i1].i - f1; data[i0+i1].r = data[i0+i1].r + f0; data[i0+i1].i = data[i0+i1].i + f1; } } }

Page 89: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

2D FFT – Code Walkthrough … 10

89

/* do the last stage of the fft */ for (i0 = 0; i0 < n/2; i0++) { f0 = data[i0 + n/2].r * w_common[i0].r - data[i0 + n/2].i * w_common[i0].i; f1 = data[i0 + n/2].r * w_common[i0].i + data[i0 + n/2].i * w_common[i0].r; data[i0 + n/2].r = data[i0].r - f0; data[i0 + n/2].i = data[i0].i - f1; data[i0].r = data[i0].r + f0; data[i0].i = data[i0].i + f1; }}/* bit_reverse - simple (but somewhat inefficient) bit reverse */void bit_reverse(a,n)mycomplex *a;int n;{ int i,j,k; j = 0; for (i=0; i<n-2; i++){ if (i < j) { SWAP(a[j],a[i]); } k = n>>1; while (k <= j) { j -= k; k >>= 1; } j += k; }}

Page 90: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

FFT Header File

90

/*************************************************************************** * FILE: mpi_2dfft.h * DESCRIPTION: see mpi_2dfft.c * AUTHOR: George Gusciora * LAST REVISED: ***************************************************************************/#define MAXN 2048 /* max 2d fft size */#define EPSILON 0.00001 /* for comparing fp numbers */#define PI 3.14159265358979 /* 4*atan(1.0) */

typedef struct {float r,i;} mycomplex;

/* swap a pair of complex numbers */#define SWAP(a,b) {float swap_temp=(a).r;(a).r=(b).r;(b).r=swap_temp;\

swap_temp=(a).i;(a).i=(b).i;(b).i=swap_temp;}

/* swap a pair of complex numbers */#define MYSWAP(a,b) {float swap_temp=a;a=b;b=swap_temp;}

Page 91: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

91

Topics

• Introduction

• Midterm Exam Review

• Matrix Multiplication

• N-Body Problem

• Fast Fourier Transform (FFT)

• Summary – Materials for Test

Page 92: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

92

Summary – Material for the Test

• Introduction – Slides: 4, 5, 6• Matrix Multiply basic algorithm – Slides: 49 – 54 • Nbody – • FFT -

Page 93: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

93