High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

High Performance Computing: Concepts, Methods & Means

Parallel Algorithms 2

Prof. Thomas SterlingDepartment of Computer Science

Louisiana State University

March 6st, 2007

2

Topics

• Introduction

• Midterm Exam Review

• Matrix Multiplication

• N-Body Problem

• Fast Fourier Transform (FFT)

• Summary – Materials for Test

3

Topics

• Introduction• Midterm Exam Review


• N-Body Problem



Half-Way Through (almost)

• More of more of the same• Today: some basic algorithms (in MPI)

– Matrix-matrix multiply– N-body– FFT

• But first: a brief walk through in preparation for the Midterm exam! (Good Luck)

5

Topics

• Introduction

• Midterm Exam Review• Matrix Multiplication

• N-Body Problem



How to Prepare for Midterm

• Closed Book exam• Will look like a problem set (use as a template)• Study aids:

– Summary slide at the end of each lecture– Problem sets

• Will emphasize– Basic knowledge– Skills– Performance models

• Note:– To be held in room 338 Johnston Hall– 1 hour 15 minutes

– Bring a calculator (know how to use it)

HPC in Overview (1st half)• Supercomputing evolves as interplay

– Device technology– Computer architecture– Execution models and programming methods

• Three classes of parallel computing– Capacity– Cooperative– Capability

• Three execution models– Throughput – Shared memory multithreaded– Communicating sequential processes (message passing)

• Three programming formalisms– Condor– OpenMP– MPI

• Performance modeling and measurement– Metrics– Models– Measurement tools

S1 – L3 - Benchmarking

• Basic performance metrics (slide 4)• Definition of benchmark in own words; purpose of

benchmarking; properties of good benchmark (slides 5, 6, 7)

• Linpack: what it is, what does it measure, concepts and complexities (slides 15, 17, 18)

• HPL: algorithms and concepts (slides 21 through 24)• Linpack compare and contrast (slide 25)• General knowledge about HPCC and NPB suites

(slides 31, 34, 35)• Benchmark result interpretation (slides 49, 50)

8

S1 – L4 : Capacity Computing• Understand material on slide (4,5),(7,8)• Understand example detailed in slides 17, 18• Understand (19) and be able to derive (20,21), (22,

23)• Understand Condor concepts detailed in slides

30,31,32• Condor Commands (37-47) : know what the basic

commands are, what they do and interpret output presented by them etc. (No need to memorize command-line options)

• Understand issues listed on slide 53• Required reading materials :

– http://www.cct.lsu.edu/~cdekate/7600/beowulf-chapter-rev1.pdf

– Specific pages to focus on : 3-16 9

10

S2 – L1 : Architecture

• Need to know content on slides 11, 15, 22, 23, 33 • Understand how each of the technologies listed

on slide 7 affects performance• Understand concepts on slides 8,9• Understand concepts on slides 17, 18, 20 • Understand pipelining concepts and equations

detailed in slides 27, 28• Understand vector processing concepts and

equations detailed in slides 29, 30

10

11

S2 – L2 : SMP

• Please make sure that you have addressed all points outlined on slide 5

• Understand content on slide 7• Understand concepts, equations, problems

on slides 11, 12, 13• Understand content on 21, 24, 26, 29• Understand concepts on slides 32, 33• Understand content on slides 36, 55

• Required reading material :

http://arstechnica.com/articles/paedia/hardware/pcie.ars/1

S2 – L3 : PThreads

• Performance & cpi: slide 8• Multi thread concepts: 13, 16, 18, 19, 22, 24, 31• Thread implementations: 35 – 37• Pthreads: 43 – 45, 48, 55

S2 – L4 : Open MP

• Components: 6• Compiling: 9, 12• Environment variables: 13, 14• Top level: 15• Shared data: 18, 19, 20• Parallel Flow Control: 23, 24, 25• Synchronization: 32, 34• Performance: 39• Synopsis: 44

S2 – L5 : Performance 2

• Measuring System Operation slides: 11, 13, 17• Gprof slides: 21, 22• Perfsuite slides: 25, 29• PAPI slides: 33 – 36 (inclusive)• Tau slides: 56 – 60 (inclusive)

15

S3 – L1 : Communicating Sequential Processes

• Basics: slides 6 – 9, 16

• CSP: slides 19

• Unix: slides 24, 28 - 30

S3 – L2 : MPI

• MPI standard: slides 4,7• Compile and Run an MPI Program: slides 10,11 • Environment functions: slides 12,14• Point-to-point functions: slides 27,28• Blocking vs. nonblocking: slides 25,26• Deadlock: slides 29-31• Basic collective functions: slides

33,34,36,38,40,41,43

17

S3 – L3 : Performance 3

• Essential MPI - Slide: 9• Performance Models - Slide: 12, 15, 16, 18 (Hockney)• LogP - Slide: 20 – 23• Effective Bandwidth – Slide: 30• Tau/MPI – Slide: 41, 43

18

S3 – L4 : Parallel Algorithms

• Introduction – Slides: 4, 5, 6• Array decomposition – Slides: 11, 12• Mandelbrot load balancing – Slides: 25, 26• Monte Carlo create Communicators – Slides: 40, 42

System level Overview• Understand the 3 classes of parallel computing (capacity,

cooperative, capacity).• Software System

– Understand the Software stack (eg OS, Compilers… ) used in various supercomputers

– Conceptual understanding of different Parallel Programming models (eg. shared memory, message passing …), advantages and disadvantages of each system.

• Computer Architecture– Understand and be able to discuss different sources of performance

degradation (latency, overhead…etc)

– Understand Amdahl’s Law and be able to solve problems related to the same, as well as scalability, efficiency, and cpi

– Understand and be able to describe the different forms of Hardware parallelism (pipelining, ILP, multiprocessors (SIMD,MIMD etc)

– Understand numerical problems provided in section 1 Problem Sets(1,2,3) and associated equations & theory behind them 19

Execution Models

• Throughput Execution Model (eg. Condor) – Be aware of the various condor commands– Thoroughly understand core Condor concepts (eg. Class Ads

and Matchmaking), and how these concepts work together

• Shared Memory multithreaded (eg. OpenMP)– Understand sources of contention (Race Condition) and how to

resolve them (Critical Sections etc… )– Understand various OpenMP constructs and how they work.

(eg be able to answer questions like how and when to use “critical” construct and its performance implication etc. )

– Understand the concept of Shared, Private and Reduction variables.

– Be able read and understand simple OpenMP code ( C )and be able to make conceptual changes where and when asked.

20

Execution Models and Performance

• Communicating Sequential Processes– conceptual understanding of CSP– Know the meaning of various MPI constructs common usage– Understand the fundamental concepts like deadlock and how to

resolve them– Be able to read a small code snippet and correct conceptual ( NOT

SYNTACTICAL ) problems. You donot need to memorize SYNTAX of MPI constructs.

• Performance & Benchmarking– Be aware of the Top500 list, and benchmarks used – be aware of the different benchmarks, what each of them stress

(linpack, HPL, different components of HPCC …)– be aware of the different performance tools discussed in class and

what they measure.– Understand and be able to solve Problems related to LogP Model

21

Key Terms and Concepts

• Speedup : Relative reduction of execution time of a fixed size workload through parallel execution

• Efficiency : Ratio of the actual performance to the best possible performance.

22

processorsNontimeexecution

processoroneontimeexecutionSpeedup

____

____

)______(

____

processorsofnumberprocessorsmultipleontimeexecution

processoroneontimeexecutionEfficiency

Ideal Speedup Example

23

W

220

w1 w210 210

P28

210 210 210 210

Processors

212

P1

T(1)=220

T(28)=212

812

20

22

2Speedup

1222

2 0812

20

Efficiency

Units : steps

i

iwW

Ideal Speedup Issues

24

• W is total workload measured in elemental pieces of work (e.g. operations, instructions, etc.)

• T(p) is total execution time measured in elemental time steps (e.g. clock cycles) where p is # of execution sites (e.g. processors, threads)

• wi is work for a given task i

• Example: here we divide a million (really Mega) operation workload, W, in to a thousand tasks, w1 to w1024 each of a 1 K operations

• Assume 256 processors performing workload in parallel

• T(256) = 4096 steps, speedup = 256, Eff = 1

Amdahl’s Law

gf

f

S

Tgf

Tf

TS

Tg

fTfT

TTf

TTS

S

f

g

T

T

T

OO

O

OOA

OF

AO

F

A

O

1

1

1

1

appliedon acceleratin with computatio of up speed

daccelerate be n tocomputatio daccelerate-non offraction

ncomputatio ofportion dacceleratefor gain eperformancpeak

daccelerate becan n that computatio ofportion of time

ncomputatio dacceleratefor time

ncomputatio daccelerate-nonfor time

start end

TO

TF

start end

TA

TF/g

Overhead

26

vv ww

W=4v+4wW=4v+4w

P

Wwi

wvT

PW

P

WPP

PW

W

PWW

T

TS

P

1

11

v = overheadw = work unitW = Total workTi = execution time with i processorsP = # processors

P

iiwW

1

Assumption : Workload is infinitely divisible

Scalability & Overhead

27

gg

P

gP

gg

ggP

gg

wv

P

wv

PW

W

T

TS

w

v

P

WT

w

v

P

Wvw

Pw

Wvw

P

JT

WvWT

w

W

w

WtasksJ

11

1

1)(

#

1

1 when W >> v

v = overheadwg = work unitW = Total workTi = execution time with i processorsP = # ProcessorsJ = # Tasks

Scalability and Overhead for fixed sized work tasks

28

• W is divided in to J wg sized tasks

• Each task requires v overhead work to manage• For P processors there are approximates J/P tasks

to be performed in sequence so,

• TP is J(wg + v)/P

• Note that S = T1 / TP• So, S = P / (1 + v / wg)

Measuring LogP Parameters

• Finding L+2*o

– Proc 0: (MPI_Send() then MPI_Recv()) x N

– Proc 1: (MPI_Recv() then MPI_Send()) x N

– L+2*o = total time/N

Figure 1: Time diagram for benchmark 1(a) is Time diagram of processor 0(b) is Time diagram of processor 1

Measuring LogP Parameters

• Finding o

– Proc 0: (MPI_Send() then some_work then MPI_Recv() ) x N

– Proc 1: (MPI_Recv() then MPI_Send() then some_work) x N

– o = (1/2)total time/N – time(some_work)

– requires time(some_work) > 2*L+2*o

Figure 2: Time diagram for benchmark 2 with X > 2*L+Or+Os(a) is Time diagram of processor 1(b) is Time diagram of processor 2

Performance Metrics

• Peak floating point operations per second (flops)• Peak instructions per second (ips)• Sustained throughput

– flops, Mflops, Gflops, Tflops, Pflops– flops, Megaflops, Gigaflops, Teraflops, Petaflops– ips, Mips, …

• Cycles per instruction– cpi – Alternatively: instructions per cycle, ipc

• Memory access latency– cycles per second

• Memory access bandwidth– bytes per second– or Gigabytes per second, GBps, GB/s

• Bi-section bandwidth– bytes per second

31

32

CPI (continued)

cMmissmissMhitmissMRR

MmissmissMhitmissM

MRMMRR

MM

RR

c

tcpircpirmcpimI# T

cpircpircpi

mmcpimcpimcpi

I#I#m

I#I#m

tcpiI#T

1

1

0.1 where

33

Basic Performance Metrics• Time related:

– Execution time [seconds]• wall clock time• system and user time

– Latency– Response time

• Rate related:– Rate of computation

• floating point operations per second [flops]• integer operations per second [ops]

– Data transfer (I/O) rate [bytes/second]• Effectiveness:

– Efficiency [%]– Memory consumption [bytes]– Productivity [utility/($*second)]

• Modifiers:– Sustained– Peak– Theoretical peak

34

Basic Parallel (MPI) Program Steps

• Establish logical bindings• Initialize application execution environment• Distribute data and work• Perform core computations in parallel (across nodes)• Synchronize and Exchange intermediate data results

– Optional for non-embarrassingly parallel (cooperative)

• Detect “stop” condition– Maybe implicit with a barrier etc.

• Aggregate final results– Often a reduction operator

• Output results and error code• Terminate and return to OS

The Essential MPI

• API Elements : – MPI_Init(), MPI_Finalize()– MPI_Comm_size(), MPI_Comm_rank() – MPI_COMM_WORLD– Error checking using MPI_SUCCESS– MPI basic data types (slide 27)– Blocking : MPI_Send(), MPI_Recv()– Non-Blocking : MPI_Isend(), MPI_Irecv(), MPI_Wait()– Collective Calls : MPI_Barrier(), MPI_Bcast(), MPI_Gather(),

MPI_Scatter(), MPI_Reduce()

• Commands : – Running MPI Programs : mpirun– Compile : mpicc – Compile : mpif77

36

Topics

• Introduction


• Matrix Multiplication• N-Body Problem



37

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Matrices — A ReviewAn n x m matrix

38

Matrix Multiplication

Multiplication of two matrices, A and B, produces the matrix Cwhose elements, ci,j (0 <= i < n, 0 <= j < m), are computed as follows:

where A is an n x l matrix and B is an l x m matrix.


39

Matrix multiplication, C = A x B


40

Implementing Matrix MultiplicationSequential Code

Assume throughout that the matrices are square (n x n matrices).The sequential code to compute A x B could simply be

for (i = 0; i < n; i++)for (j = 0; j < n; j++) {

c[i][j] = 0;for (k = 0; k < n; k++)

c[i][j] = c[i][j] + a[i][k] * b[k][j]; }

This algorithm requires n3 multiplications and n3 additions, leading to a sequential time complexity of O(n3). Very easy to parallelize.


41

Block Matrix Multiplication


42Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

43

Performance Improvement

Using tree construction n numbers can be added in log n steps using n processors:

Computational timecomplexity of O(log n)using n3 processors.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

44

Flowchart for Matrix Multiplication“master” “workers”

Initialize MPI EnvironmentInitialize MPI Environment



… Initialize MPI EnvironmentInitialize MPI Environment

Initialize ArrayInitialize Array

Partition Array into workloads Partition Array into workloads

Send Workload to “workers”

Send Workload to “workers”

Recv. workRecv. work Recv. workRecv. work … Recv. workRecv. work

wait for “workers“ to finish task

wait for “workers“ to finish task

Calculate matrix product





Calculate matrix product…

Send resultSend result Send resultSend result … Send resultSend result

Recv. resultsRecv. results

Print resultsPrint results

EndEnd

45

Matrix Multiplication (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>

#define NRA 62 /* number of rows in matrix A */#define NCA 15 /* number of columns in matrix A */#define NCB 7 /* number of columns in matrix B */#define MASTER 0 /* taskid of first task */#define FROM_MASTER 1 /* setting a message type */#define FROM_WORKER 2 /* setting a message type */

int main(argc,argv)int argc;char *argv[];{int numtasks, /* number of tasks in partition */

taskid, /* a task identifier */numworkers, /* number of worker tasks */source, /* task id of message source */dest, /* task id of message destination */mtype, /* message type */rows, /* rows of matrix A sent to each worker */averow, extra, offset, /* used to determine rows sent to each worker */i, j, k, rc; /* misc */

double a[NRA][NCA], /* matrix A to be multiplied */b[NCA][NCB], /* matrix B to be multiplied */c[NRA][NCB]; /* result matrix C */

MPI_Status status;

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

46

Matrix Multiplication (source code)/* Initialize MPI Environment */

MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&taskid);MPI_Comm_size(MPI_COMM_WORLD,&numtasks);if (numtasks < 2 ) { printf("Need at least two MPI tasks. Quitting...\n"); MPI_Abort(MPI_COMM_WORLD, rc); exit(1); }numworkers = numtasks-1;

/* Master block*/ if (taskid == MASTER) { printf("mpi_mm has started with %d tasks.\n",numtasks); printf("Initializing arrays...\n"); for (i=0; i<NRA; i++) for (j=0; j<NCA; j++) a[i][j]= i+j; /* Initialize array a */ for (i=0; i<NCA; i++) for (j=0; j<NCB; j++) b[i][j]= i*j; /* Initialize array b */ /* Send matrix data to the worker tasks */ averow = NRA/numworkers; /* determining fraction of array to be processed by “workers” */ extra = NRA%numworkers; offset = 0; mtype = FROM_MASTER; /* Message Tag */


47

Matrix Multiplication (source code) for (dest=1; dest<=numworkers; dest++) { /* To each worker send : Start point, number of rows to process, and sub-arrays to process */ rows = (dest <= extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset); MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); offset = offset + rows; }

/* Receive results from worker tasks */ mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */ for (i=1; i<=numworkers; i++) { source = i;

/* offset stores the (processing) starting point of work chunk */ MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);

/* The array C contains the product of sub-array A and the array B */ MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status); printf("Received results from task %d\n",source); } printf("******************************************************\n"); printf("Result Matrix:\n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", c[i][j]); } printf("\n******************************************************\n"); printf ("Done.\n"); }

48

Matrix Multiplication (source code)

/**************************** worker task ************************************/ if (taskid > MASTER) { mtype = FROM_MASTER; MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);

for (k=0; k<NCB; k++) for (i=0; i<rows; i++) { c[i][k] = 0.0; for (j=0; j<NCA; j++)

/* Calculate the product and store result in C */ c[i][k] = c[i][k] + a[i][j] * b[j][k]; } mtype = FROM_WORKER; MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);

/* Worker sends the resultant array to the master */ MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD); } MPI_Finalize();}


49

Demo : Matrix Multiplication[cdekate@compute-0-6 matrix_multiplication]$ mpirun -np 4 ./mpi_mmmpi_mm has started with 4 tasks.Initializing arrays...Sending 21 rows to task 1 offset=0Sending 21 rows to task 2 offset=21Sending 20 rows to task 3 offset=42Received results from task 1Received results from task 2Received results from task 3******************************************************Result Matrix:

0.00 1015.00 2030.00 3045.00 4060.00 5075.00 6090.00 0.00 1120.00 2240.00 3360.00 4480.00 5600.00 6720.00 0.00 1225.00 2450.00 3675.00 4900.00 6125.00 7350.00 0.00 1330.00 2660.00 3990.00 5320.00 6650.00 7980.00 0.00 1435.00 2870.00 4305.00 5740.00 7175.00 8610.00 0.00 1540.00 3080.00 4620.00 6160.00 7700.00 9240.00 0.00 1645.00 3290.00 4935.00 6580.00 8225.00 9870.00 ……… 0.00 6475.00 12950.00 19425.00 25900.00 32375.00 38850.00 0.00 6580.00 13160.00 19740.00 26320.00 32900.00 39480.00 0.00 6685.00 13370.00 20055.00 26740.00 33425.00 40110.00 0.00 6790.00 13580.00 20370.00 27160.00 33950.00 40740.00 0.00 6895.00 13790.00 20685.00 27580.00 34475.00 41370.00 0.00 7000.00 14000.00 21000.00 28000.00 35000.00 42000.00 0.00 7105.00 14210.00 21315.00 28420.00 35525.00 42630.00 0.00 7210.00 14420.00 21630.00 28840.00 36050.00 43260.00 0.00 7315.00 14630.00 21945.00 29260.00 36575.00 43890.00 0.00 7420.00 14840.00 22260.00 29680.00 37100.00 44520.00 ******************************************************Done.[cdekate@compute-0-6 matrix_multiplication]$

50

Topics

• Introduction



• N-Body Problem• Fast Fourier Transform (FFT)


51

N Bodies

OU Supercomputing Center for Education & Research

52OU Supercomputing Center for Education & ResearchImg src : http://www.lsbu.ac.uk/water

N-Body Problems

An N-body problem is a problem involving N “bodies” – that is, particles (e.g., stars, atoms) – each of which applies a force to all of the others.

For example, if you have N stars, then each of the N stars exerts a force (gravity) on all of the other N–1 stars.

Likewise, if you have N atoms, then every atom exerts a force on all of the other N–1 atoms. The forces are Coulombic and van der Waal’s.

53

2-Body Problem

When N is 2, you have – surprise! – a 2-Body Problem: exactly two particles, each exerting a force that acts on the other.

The relationship between the 2 particles can be expressed as a differential equation that can be solved analytically, producing a closed-form solution.

So, given the particles’ initial positions and velocities, you can immediately calculate their positions and velocities at any later time.


54

N-Body Problems

For N of 3 or more, no one knows how to solve the equations to get a closed form solution.

So, numerical simulation is pretty much the only way to study groups of 3 or more bodies.

Popular applications of N-body codes include astronomy and chemistry.

Note that, for N bodies, there are on the order of N2 forces, denoted O(N2).


55

N-Body Problems

Given N bodies, each body exerts a force on all of the other N–1 bodies.

Therefore, there are N • (N–1) forces in total.

You can also think of this as (N • (N–1))/2 forces, in the sense that the force from particle A to particle B is the same (except in the opposite direction) as the force from particle B to particle A.


56

N-Body Problems

Given N bodies, each body exerts a force on all of the other N–1 bodies.

Therefore, there are N • (N–1) forces in total.

In Big-O notation, that’s O(N2) forces to calculate.

So, calculating the forces takes O(N2) time to execute.

But, there are only N particles, each taking up the same amount of memory, so we say that N-body codes are of:

• O(N) spatial complexity (memory)• O(N2) time complexity


57

O(N2) Forces

Note that this picture shows only the forces between A and everyone else.

A


58

How to Calculate?

Whatever your physics is, you have some function, F(A,B), that expresses the force between two bodies A and B.

For example,

F(A,B) = G · mA · mB / dist(A,B)2 where G is the gravitational constant and m is the mass of the

particle in question.If you have all of the forces for every pair of particles, then you can

calculate their sum, obtaining the force on every particle.


59

How to Parallelize?

Okay, so let’s say you have a nice serial (single-CPU) code that does an N-body calculation.

How are you going to parallelize it?You could:• have a master feed particles to processes;• have a master feed interactions to processes;• have each process decide on its own subset of the

particles, and then share around the forces;• have each process decide its own subset of the

interactions, and then share around the forces.


60

Do You Need a Master?

Let’s say that you have N bodies, and therefore you have ½N(N-1) interactions (every particle interacts with all of the others, but you don’t need to calculate both A B and B A).

Do you need a master?

Well, can each processor determine on its own either (a) which of the bodies to process, or (b) which of the interactions?

If the answer is yes, then you don’t need a master.


61

N-Body “Pipeline” Implementation Flowchart

Create ring communicator

Initialize particle parameters

Copy local particle data to send buffer

Update positions of local particles

All iterations done?

Finalize MPI

N

Y

Initiate transmission of send buffer to the RIGHT neighbor in ring

Initiate reception of data from the LEFT neighbor in ring

Compute forces between local and send buffer particles

Processed particles from all remote nodes?

N

Wait for message exchange to complete

Copy particle data from receive buffer to send buffer

Y

Initialize MPI environment

62

N-Body (source code)

#include "mpi.h"#include <stdlib.h>#include <stdio.h>#include <string.h>#include <math.h>

/* Pipeline version of the algorithm... *//* we really need the velocities as well… */

/* Simplified structure describing parameters of a single particle */typedef struct { double x, y, z; double mass; } Particle;/* We use leapfrog for the time integration ... */

/* Structure to hold force components and old position coordinates of a particle */typedef struct { double xold, yold, zold; double fx, fy, fz; } ParticleV;

void InitParticles( Particle[], ParticleV [], int );double ComputeForces( Particle [], Particle [], ParticleV [], int );double ComputeNewPos( Particle [], ParticleV [], int, double, MPI_Comm );

#define MAX_PARTICLES 4000#define MAX_P 128

63


main( int argc, char *argv[] ){ Particle particles[MAX_PARTICLES]; /* Particles on ALL nodes */ ParticleV pv[MAX_PARTICLES]; /* Particle velocity */ Particle sendbuf[MAX_PARTICLES], /* Pipeline buffers */

recvbuf[MAX_PARTICLES]; MPI_Request request[2]; int counts[MAX_P], /* Number on each processor */ displs[MAX_P]; /* Offsets into particles */ int rank, size, npart, i, j,

offset; /* location of local particles */ int totpart, /* total number of particles */

cnt; /* number of times in loop */ MPI_Datatype particletype; double sim_t; /* Simulation time */ double time; /* Computation time */ int pipe, left, right, periodic; MPI_Comm commring; MPI_Status statuses[2];

/* Initialize MPI Environment */ MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size );

/* Create 1-dimensional periodic Cartesian communicator (a ring) */ periodic = 1; MPI_Cart_create( MPI_COMM_WORLD, 1, &size, &periodic, 1, &commring ); MPI_Cart_shift( commring, 0, 1, &left, &right ); /* Find the closest neighbors in ring */

/* Calculate local fraction of particles */ if (argc < 2) {

fprintf( stderr, "Usage: %s n\n", argv[0] );MPI_Abort( MPI_COMM_WORLD, 1 );

} npart = atoi(argv[1]) / size; if (npart * size > MAX_PARTICLES) {

fprintf( stderr, "%d is too many; max is %d\n", npart*size, MAX_PARTICLES );MPI_Abort( MPI_COMM_WORLD, 1 );

} MPI_Type_contiguous( 4, MPI_DOUBLE, &particletype ); /* Data type corresponding to Particle struct */ MPI_Type_commit( &particletype );

/* Get the sizes and displacements */ MPI_Allgather( &npart, 1, MPI_INT, counts, 1, MPI_INT, commring ); displs[0] = 0; for (i=1; i<size; i++)

displs[i] = displs[i-1] + counts[i-1]; totpart = displs[size-1] + counts[size-1];

/* Generate the initial values */ InitParticles( particles, pv, npart); offset = displs[rank]; cnt = 10; time = MPI_Wtime(); sim_t = 0.0;

/* Begin simulation loop */ while (cnt--) {

double max_f, max_f_seg;

64


65

N-Body (source code)/* Load the initial send buffer */memcpy( sendbuf, particles, npart * sizeof(Particle) );max_f = 0.0;for (pipe=0; pipe<size; pipe++) { if (pipe != size-1) {

/* Initialize send to the “right” neighbor, while receiving from the “left” */MPI_Isend( sendbuf, npart, particletype, right, pipe, commring, &request[0] );MPI_Irecv( recvbuf, npart, particletype, left, pipe, commring, &request[1] );

} /* Compute forces */ max_f_seg = ComputeForces( particles, sendbuf, pv, npart ); if (max_f_seg > max_f) max_f = max_f_seg;

/* Wait for updates to complete and copy received particles to the send buffer */ if (pipe != size-1) MPI_Waitall( 2, request, statuses ); memcpy( sendbuf, recvbuf, counts[pipe] * sizeof(Particle) );}/* Compute the changes in position using the already calculated forces */sim_t += ComputeNewPos( particles, pv, npart, max_f, commring );

/* We could do graphics here (move particles on the display) */ } time = MPI_Wtime() - time; if (rank == 0) {

printf( "Computed %d particles in %f seconds\n", totpart, time ); } MPI_Finalize(); return 0;}

66

N-Body (source code)/* Initialize particle positions, masses and forces */void InitParticles( Particle particles[], ParticleV pv[], int npart ){ int i; for (i=0; i<npart; i++) {

particles[i].x = drand48();particles[i].y = drand48();particles[i].z = drand48();particles[i].mass = 1.0;pv[i].xold = particles[i].x;pv[i].yold = particles[i].y;pv[i].zold = particles[i].z;pv[i].fx = 0;pv[i].fy = 0;pv[i].fz = 0;

}}/* Compute forces (2-D only) */double ComputeForces( Particle myparticles[], Particle others[], ParticleV pv[], int npart ){ double max_f, rmin; int i, j;

max_f = 0.0; for (i=0; i<npart; i++) { double xi, yi, mi, rx, ry, mj, r, fx, fy; rmin = 100.0; xi = myparticles[i].x; yi = myparticles[i].y; fx = 0.0; fy = 0.0;

67

N-Body (source code)for (j=0; j<npart; j++) { rx = xi - others[j].x; ry = yi - others[j].y; mj = others[j].mass; r = rx * rx + ry * ry; /* ignore overlap and same particle */ if (r == 0.0) continue; if (r < rmin) rmin = r; /* compute forces */ r = r * sqrt(r); fx -= mj * rx / r; fy -= mj * ry / r; } pv[i].fx += fx; pv[i].fy += fy; /* Compute a rough estimate of (1/m)|df / dx| */ fx = sqrt(fx*fx + fy*fy)/rmin; if (fx > max_f) max_f = fx; } return max_f;}

/* Update particle positions (2-D only) */double ComputeNewPos( Particle particles[], ParticleV pv[], int npart, double max_f, MPI_Comm commring ){ int i; double a0, a1, a2; static double dt_old = 0.001, dt = 0.001; double dt_est, new_dt, dt_new;

68

N-Body (source code)/* integation is a0 * x^+ + a1 * x + a2 * x^- = f / m */ a0 = 2.0 / (dt * (dt + dt_old)); a2 = 2.0 / (dt_old * (dt + dt_old)); a1 = -(a0 + a2); /* also -2/(dt*dt_old) */ for (i=0; i<npart; i++) { double xi, yi; /* Very, very simple leapfrog time integration. We use a variable step version to simplify time-step control. */ xi = particles[i].x; yi = particles[i].y; particles[i].x = (pv[i].fx - a1 * xi - a2 * pv[i].xold) / a0; particles[i].y = (pv[i].fy - a1 * yi - a2 * pv[i].yold) / a0; pv[i].xold = xi; pv[i].yold = yi; pv[i].fx = 0; pv[i].fy = 0; } /* Recompute a time step. Stability criteria is roughly 2/sqrt(1/m |df/dx|) >= dt. We leave a little room */ dt_est = 1.0/sqrt(max_f); if (dt_est < 1.0e-6) dt_est = 1.0e-6; MPI_Allreduce( &dt_est, &dt_new, 1, MPI_DOUBLE, MPI_MIN, commring ); /* Modify time step */ if (dt_new < dt) { dt_old = dt; dt = dt_new; } else if (dt_new > 4.0 * dt) { dt_old = dt; dt *= 2.0; } return dt_old;}

69

Demo : N-Body Problem

> mpirun –np 4 nbodypipe 4000Computed 4000 particles in 1.119051 seconds> mpirun –np 4 nbodypipe 4000Computed 4000 particles in 1.119051 seconds

70

Topics

• Introduction



• N-Body Problem

• Fast Fourier Transform (FFT)• Summary – Materials for Test

Serial FFT

• Let i = sqrt(-1) and index matrices and vectors from 0.• The Discrete Fourier Transform of an m-element vector v is: F*v Where F is the m×m matrix defined as: F[j,k] = (j*k)

Where is: = e (2i/m) = cos(2/m) + i*sin(2/m)• This is a complex number with whose mth power is 1 and is

therefore called the mth root of unity• E.g., for m = 4: = 0+1*i, = -1+0*i, = 0-1*i, = 1+0*i,

Source: www.cs.berkeley.edu/~demmel/cs267_Spr99/Lectures/Lect_24_1999-new.ppt

Related Transforms

• Most applications require multiplication by both F and inverse(F).

• Multiplying by F and inverse(F) are essentially the same. (inverse(F) is the complex conjugate of F divided by n.)

• For solving the Poisson equation and various other applications, we use variations on the FFT– The sin transform -- imaginary part of F

– The cos transform -- real part of F

• Algorithms are similar, so we will focus on the forward FFT.


• Compute the FFT of an m-element vector v, F*v

(F*v)[j] = F(j,k)*v(k)

= (j*k) * v(k)

= ( j)k * v(k)

= V(j)• Where V is defined as the polynomial

V(x) = xk * v(k)

Serial Algorithm for the FFT

m-1

k = 0m-1

k = 0m-1

k = 0

m-1

k = 0


Divide and Conquer FFT

• V can be evaluated using divide-and-conquer

V(x) = (x)k * v(k)

= v[0] + x2*v[2] + x4*v[4] + …

+ x*(v[1] + x2*v[3] + x4*v[5] + … )

= Veven(x2) + x*Vodd(x2)

• V has degree m, so Veven and Vodd are polynomials of degree m/2-1

• We evaluate these at points (j)2 for 0<=j<=m-1

• But this is really just m/2 different points, since

( (j+m/2) )2 = ( j *m/2) )2 = ( 2j *) = ( j)2

m-1

k = 0


Divide-and-Conquer FFT

FFT(v, v, m)

if m = 1 return v[0]

else

veven = FFT(v[0:2:m-2], 2, m/2)

vodd = FFT(v[1:2:m-1], 2, m/2)

-vec = [0, 1, … (m/2-1) ]

return [veven + (-vec .* vodd),

veven - (-vec .* vodd) ]• The .* above is component-wise multiply.• The […,…] is construction an m-element vector from 2 m/2 element vectors

This results in an O(m log m) algorithm.

precomputed


1D FFT: Butterfly Pattern


Higher Dimension FFTs

• FFTs on 2 or 3 dimensions are defined as 1D FFTs on vectors in all dimensions.

• E.g., a 2D FFT does 1D FFTs on all rows and then all columns• There are 3 obvious possibilities for the 2D FFT:

– (1) 2D blocked layout for matrix, using 1D algorithms for each row and column

– (2) Block row layout for matrix, using serial 1D FFTs on rows, followed by a transpose, then more serial 1D FFTs

– (3) Block row layout for matrix, using serial 1D FFTs on rows, followed by parallel 1D FFTs on columns

– Option 1 is best

• For a 3D FFT the options are similar– 2 phases done with serial FFTs, followed by a transpose for 3rd

– can overlap communication with 2nd phase in practice


78

2-D FFT Flowchart

Initialize matrix

Distribute matrix by rows (MPI_Scatter)

Reorganize matrix slice into square chunks

Finalize MPI


Compute 1-D FFTs on rows

Redistribute matrix chunks (MPI_Alltoall)

Transpose matrix chunks


Collect matrix slices (MPI_Gather)

Transpose assembled matrix

Receive matrix slice (MPI_Scatter)

Reorganize matrix slice into square chunks

Finalize MPI



Redistribute matrix chunks (MPI_Alltoall)

Transpose matrix chunks


Send matrix slice (MPI_Gather)

MASTER

WORKER

Fast Fourier Transform

79

MPI - Two-Dimensional Fast Fourier Transform - C Version

•The image originates on a single processor (SOURCE_PROCESSOR). •This image, a[], is distributed by rows to all other processors. •Each processor then performs a one-dimensional FFT on the rows of the image stored locally. •The image is then transposed using the MPI_Alltoall() routine; this partitions the intermediate image by columns. •Each processor then performs a one-dimensional FFT on the columns of the image. •Finally, the columns of the image are collected back at the destination processor and the output image is tested for correctness. •Input is a 512x512 complex matrix. The input matrix is initialized with a point source. •Output is a 512x512 complex matrix that overwrites the input matrix. •Timing and Mflop results are displayed following execution. •A straightforward unsophisticated 1D FFT kernel is used. It is sufficient to convey the general idea, but be aware that there are better 1D FFTs available on many systems.

2D FFT – Code Walkthrough … 1

80

#include <stdio.h>#include <stdlib.h>#include <sys/utsname.h>#include <math.h>#include <sys/time.h>#include <time.h>#include <mpi.h>#include "mpi_2dfft.h"

#define IMAGE_SIZE 512 #define NUM_CELLS 4#define IMAGE_SLICE (IMAGE_SIZE / NUM_CELLS)#define SOURCE_PROCESSOR 0#define DEST_PROCESSOR SOURCE_PROCESSOR

int numtasks; /* Number of processors */int taskid; /* ID number for each processor */mycomplex a[IMAGE_SIZE][IMAGE_SIZE]; /* input matrix: complex numbers */mycomplex a_slice[IMAGE_SLICE][IMAGE_SIZE];mycomplex a_chunks[NUM_CELLS][IMAGE_SLICE][IMAGE_SLICE];mycomplex b[IMAGE_SIZE][IMAGE_SIZE]; /* intermediate matrix */mycomplex b_slice[IMAGE_SIZE][IMAGE_SLICE];mycomplex b_chunks[NUM_CELLS][IMAGE_SLICE][IMAGE_SLICE];mycomplex *collect;mycomplex w_common[IMAGE_SIZE/2]; /* twiddle factors */struct timevaletime[10];int checkpoint;float dt[10], sum;


81

int main(argc,argv)int argc;char *argv[];{ int rc, cell, i, j, n, nx, logn, errors, sign, flops; float mflops; checkpoint=0;

/** Initialize MPI environment and get task's ID and number of tasks in the partition. **/ rc = MPI_Init(&argc,&argv); rc|= MPI_Comm_size(MPI_COMM_WORLD,&numtasks); rc|= MPI_Comm_rank(MPI_COMM_WORLD,&taskid); /* Must have 4 tasks for this program */ /** Checking if numtasks is a power of 2 (in this case we have set it to 4) **/ if (numtasks != NUM_CELLS) { printf("Error: this program requires %d MPI tasks\n", NUM_CELLS); exit(1); } if (rc != MPI_SUCCESS) printf ("error initializing MPI and obtaining task ID information\n"); else printf ("MPI task ID = %d\n", taskid);

n = IMAGE_SIZE; /* compute logn and ensure that n is a power of two */ nx = n; logn = 0;

/** Checking if IMAGE_SIZE is a power of 2 **/ while(( nx >>= 1) > 0) logn++; nx = 1; for (i=0; i<logn; i++) nx = nx*2; if (nx != n) { (void)fprintf(stderr, "%d: fft size must be a power of 2\n", IMAGE_SIZE); exit(0); }

/** Initialize real and imaginary parts of array (??) **/ if (taskid == SOURCE_PROCESSOR) { for (i=0; i<n; i++) for (j=0; j<n; j++) a[i][j].r = a[i][j].i = 0.0; a[n/2][n/2].r = a[n/2][n/2].i = (float)n; /* real and imaginary array[256][256] are initialized to 512.0 and rest to 0.0 */ /* print table headings in anticipation of timing results */ printf("512 x 512 2D FFT\n"); printf(" Timings(secs)\n"); printf(" scatter 1D-FFT-row transpose 1D-FFT-col gather"); printf(" total\n"); } /* precompute the complex constants (twiddle factors) for the 1D FFTs */ for (i=0;i<n/2;i++) { w_common[i].r = (float) cos((double)((2.0*PI*i)/(float)n)); w_common[i].i = (float) -sin((double)((2.0*PI*i)/(float)n)); }


82

/* Distribute Input Matrix By Rows */ rc = MPI_Barrier(MPI_COMM_WORLD); if (rc != MPI_SUCCESS) { printf("Error: MPI_Barrier() failed with return code %d\n", rc); return(-1); } gettimeofday(&etime[checkpoint++], (struct timeval*)0);

/* IMAGE_SLICE = dimension of slice of image per process Each slice of image is delivered to corresponding process using MPI_Scatter() */

rc = MPI_Scatter((char *) a, IMAGE_SLICE * IMAGE_SIZE * 2, MPI_FLOAT, (char *) a_slice, IMAGE_SLICE * IMAGE_SIZE * 2, MPI_FLOAT, SOURCE_PROCESSOR, MPI_COMM_WORLD);

if (rc != MPI_SUCCESS) { printf("Error: MPI_Scatter() failed with return code %d\n", rc); return(-1); } gettimeofday(&etime[checkpoint++], (struct timeval*)0);

/* Perform 1-D Row FFTs *//* a_slice[ ][ ] is the buffer containing each individual image chunk. For each row in image slice this section of code computes 1D FFT */

for (i=0;i<IMAGE_SLICE;i++) fft(&a_slice[i][0], w_common, n, logn);

gettimeofday(&etime[checkpoint++], (struct timeval*)0);


83


84

/* Transpose 2-D image */

for(cell=0;cell<NUM_CELLS;cell++) { for(i=0;i<IMAGE_SLICE;i++) { for(j=0;j<IMAGE_SLICE;j++) { a_chunks[cell][i][j].r = a_slice[i][j + (IMAGE_SLICE * cell)].r; a_chunks[cell][i][j].i = a_slice[i][j + (IMAGE_SLICE * cell)].i; } } }/* IMAGE_SLICE * IMAGE_SLICE * 2 (because we have real and imaginary); Each component chunk is delivered to corresponding process using MPI_Alltoall() */ rc = MPI_Alltoall(a_chunks, IMAGE_SLICE * IMAGE_SLICE * 2, MPI_FLOAT, b_slice, IMAGE_SLICE * IMAGE_SLICE * 2, MPI_FLOAT, MPI_COMM_WORLD); if (rc != MPI_SUCCESS) { printf("Error: MPI_Alltoall() failed in cell %d return code %d\n", taskid, rc); return(-1); } gettimeofday(&etime[checkpoint++], (struct timeval*)0);


85

for(i=0;i<IMAGE_SLICE;i++) { for(j=0;j<IMAGE_SIZE;j++) { a_slice[i][j].r = b_slice[j][i].r; a_slice[i][j].i = b_slice[j][i].i; } }

/* Perform 1-D FFTs (effectively on columns) */for (i=0;i<IMAGE_SLICE;i++) fft(&a_slice[i][0], w_common, IMAGE_SIZE, logn);gettimeofday(&etime[checkpoint++], (struct timeval*)0);

/* Undistribute Output Matrix by Rows */collect = ( mycomplex *) malloc(IMAGE_SIZE * IMAGE_SIZE * sizeof( mycomplex));

/* Every process executes MPI_Gather() */rc = MPI_Gather(a_slice, IMAGE_SLICE * IMAGE_SIZE * 2, MPI_FLOAT, a, IMAGE_SLICE * IMAGE_SIZE * 2,

MPI_FLOAT, DEST_PROCESSOR, MPI_COMM_WORLD);if (rc != MPI_SUCCESS){ printf("Error: MPI_Gather() failed with return code %d\n", rc); fflush(stdout);}


86

/* If destination processor then perform another transpose of a[ ][ ] into b[ ][ ]*/ if (taskid == DEST_PROCESSOR) { for(i=0;i<IMAGE_SIZE;i++) { for(j=0;j<IMAGE_SIZE;j++) { b[i][j].r = a[j][i].r; b[i][j].i = a[j][i].i; } } }

gettimeofday(&etime[checkpoint++], (struct timeval*)0); fflush(stdout);

/* Calculate event timings and flops - then print them */ for(i=1;i<checkpoint;i++) dt[i] = ((float) ((etime[i].tv_sec - etime[i-1].tv_sec) * 1000000 + etime[i].tv_usec - etime[i-1].tv_usec)) / 1000000.0; printf("cell %d: ", taskid); for(i=1;i<checkpoint;i++) printf("%2.6f ", dt[i]); sum=0; for(i=1;i<checkpoint;i++) sum+=dt[i]; printf(" %2.6f \n", sum);


87

if (taskid == DEST_PROCESSOR) { flops = (n*n*logn)*10; mflops = ((float)flops/1000000.0); mflops = mflops/(float)sum; printf("Total Mflops= %3.4f\n", mflops); errors = 0; for (i=0;i<n;i++) { if (((i+1)/2)*2 == i) sign = 1; else sign = -1; for (j=0;j<n;j++) { if (b[i][j].r > n*sign+EPSILON || b[i][j].r < n*sign-EPSILON || b[i][j].i > n*sign+EPSILON || b[i][j].i < n*sign-EPSILON) { printf("[%d][%d] is %f,%f should be %f\n", i, j, b[i][j].r, b[i][j].i, (float) n*sign); errors++; } sign *= -1; } } if (errors) { printf("%d errors!!!!!\n", errors); exit(0); } } MPI_Finalize(); exit(0);}


88

fft(data,w_common,n,logn)mycomplex *data,*w_common;int n,logn;{ int incrvec, i0, i1, i2, nx; float f0, f1; void bit_reverse();

/* bit-reverse the input vector */ (void)bit_reverse(data,n);

/* do the first logn-1 stages of the fft */ i2 = logn; for (incrvec=2;incrvec<n;incrvec<<=1) { i2--; for (i0 = 0; i0 < incrvec >> 1; i0++) { for (i1 = 0; i1 < n; i1 += incrvec) { f0 = data[i0+i1 + incrvec/2].r * w_common[i0<<i2].r - data[i0+i1 + incrvec/2].i * w_common[i0<<i2].i; f1 = data[i0+i1 + incrvec/2].r * w_common[i0<<i2].i + data[i0+i1 + incrvec/2].i * w_common[i0<<i2].r; data[i0+i1 + incrvec/2].r = data[i0+i1].r - f0; data[i0+i1 + incrvec/2].i = data[i0+i1].i - f1; data[i0+i1].r = data[i0+i1].r + f0; data[i0+i1].i = data[i0+i1].i + f1; } } }


89

/* do the last stage of the fft */ for (i0 = 0; i0 < n/2; i0++) { f0 = data[i0 + n/2].r * w_common[i0].r - data[i0 + n/2].i * w_common[i0].i; f1 = data[i0 + n/2].r * w_common[i0].i + data[i0 + n/2].i * w_common[i0].r; data[i0 + n/2].r = data[i0].r - f0; data[i0 + n/2].i = data[i0].i - f1; data[i0].r = data[i0].r + f0; data[i0].i = data[i0].i + f1; }}/* bit_reverse - simple (but somewhat inefficient) bit reverse */void bit_reverse(a,n)mycomplex *a;int n;{ int i,j,k; j = 0; for (i=0; i<n-2; i++){ if (i < j) { SWAP(a[j],a[i]); } k = n>>1; while (k <= j) { j -= k; k >>= 1; } j += k; }}

FFT Header File

90

/*************************************************************************** * FILE: mpi_2dfft.h * DESCRIPTION: see mpi_2dfft.c * AUTHOR: George Gusciora * LAST REVISED: ***************************************************************************/#define MAXN 2048 /* max 2d fft size */#define EPSILON 0.00001 /* for comparing fp numbers */#define PI 3.14159265358979 /* 4*atan(1.0) */

typedef struct {float r,i;} mycomplex;

/* swap a pair of complex numbers */#define SWAP(a,b) {float swap_temp=(a).r;(a).r=(b).r;(b).r=swap_temp;\

swap_temp=(a).i;(a).i=(b).i;(b).i=swap_temp;}

/* swap a pair of complex numbers */#define MYSWAP(a,b) {float swap_temp=a;a=b;b=swap_temp;}

91

Topics

• Introduction



• N-Body Problem



92

Summary – Material for the Test

• Introduction – Slides: 4, 5, 6• Matrix Multiply basic algorithm – Slides: 49 – 54 • Nbody – • FFT -

93

High Performance Computing: Concepts, Methods & Means Parallel Algorithms 2 Prof. Thomas Sterling Department of Computer Science Louisiana State University.

Documents

condor concepts

performanceunderstand

pipelining concepts

summary slide

contrast slide

vector processing concepts

multi thread concepts

fftsummary materials