Top Banner
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p. 1
28

Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Jun 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Introduction to parallel computersand parallel programming

Introduction to parallel computersand parallel programming – p. 1

Page 2: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Content

A quick overview of morden parallel hardwareParallelism within a chip

PipeliningSuperscaler executionSIMDMultiple cores

Parallelism within a compute nodeMultiple socketsUMA vs. NUMA

Parallelism across multiple nodes

A very quick overview of parallel programming

Introduction to parallel computersand parallel programming – p. 2

Page 3: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

First things first

CPU—central processing unit—is the “brain” of a computer

CPU processes instructions, many of which require data transfersfrom/to the memory on a computer

CPU integrates many components (registers, FPUs, caches...)

CPU has a “clock”, which at each clock cycle synchronizes the logicunits within the CPU to process instructions

Introduction to parallel computersand parallel programming – p. 3

Page 4: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

An example of a CPU core

Block diagram of an Intel Xeon Woodcrest CPU core

Introduction to parallel computersand parallel programming – p. 4

Page 5: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Instruction pipelining

Suppose every instruction has five stages, each taking one cycle

Without instruction pipelining

With instruction pipelining

Introduction to parallel computersand parallel programming – p. 5

Page 6: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Superscalar execution

Multiple execution units ⇒ more than one instruction can finish per cycle

An enhanced form of instruction-level parallelism

Introduction to parallel computersand parallel programming – p. 6

Page 7: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Data

Data are stored in computer memory as sequence of 0s and 1s

Each 0 or 1 occupies one bit

8 bits constitute one byte

Normally, in the C language:char: 1 byteint: 4 bytesfloat: 4 bytesdouble: 8 bytes

Bandwidth—the speed of data transfer—is measured as number ofbytes transferred per second

Introduction to parallel computersand parallel programming – p. 7

Page 8: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

SIMD

SISD: single instruction stream single data stream

SIMD: single instruction stream multiple data streams

Introduction to parallel computersand parallel programming – p. 8

Page 9: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

An example of a floating-point unit

FP unit on an Intel Xeon CPU

Introduction to parallel computersand parallel programming – p. 9

Page 10: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Multicore processor

Modern hardware technology can put several independent CPU cores onthe same chip—a multicore processor

Intel Xeon Nehalem quad-core processor

Introduction to parallel computersand parallel programming – p. 10

Page 11: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Multi-threading

Modern CPU cores often have threading capability

Hardware support for multiple threads to be executed within a core

However, threads have to share resources of a corecomputing unitscachestranslation lookaside buffer

Introduction to parallel computersand parallel programming – p. 11

Page 12: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Vector processor

Another approach, different from multicore (and multi-threading)

Massive SIMDVector registersDirect pipes into main memory with high bandwidth

Used to be the dominating high-performance computing hardware,but now only niche technology

Introduction to parallel computersand parallel programming – p. 12

Page 13: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Multi-socket

Socket—a connection on motherboard that a processor is pluggedinto

Modern computers often have several socketsEach socket holds a multicore processorExample: Nehalem-EP (2×socket, quad-core CPUs, 8 cores intotal)

Introduction to parallel computersand parallel programming – p. 13

Page 14: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Shared memory

Shared memory: all CPU cores can access all memory as globaladdress space

Traditionally called “multiprocessor”

Introduction to parallel computersand parallel programming – p. 14

Page 15: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

UMA

UMA—uniform memory access, one type of shared memory

Another name for symmetric multi-processing

Dual-socket Xeon Clovertown CPUs

Introduction to parallel computersand parallel programming – p. 15

Page 16: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

NUMA

NUMA—non-uniform memory access, another type of sharedmemory

Several symmetric multi-processing units are linked together

Each core should access its closest memory unit, as much aspossible

Dual-socket Xeon Nehalem CPUsIntroduction to parallel computersand parallel programming – p. 16

Page 17: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Cache coherence

Important for shared-memory systems

If one CPU core updates a value in its private cache, all the othercores “know” about the update

Cache coherence is accomplished by hardware

Chapter 2.4.6 of Introduction to Parallel Computing describes severalstrategies of achieving cache coherence

Introduction to parallel computersand parallel programming – p. 17

Page 18: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

“Competition” among the cores

Within a multi-socket multicore computer, some resources are shared

Within a socket, the cores share the last-level cache

The memory bandwidth is also shared to a great extent

# cores 1 2 4 6 8

BW 3.42 GB/s 4.56 GB/s 4.57 GB/s 4.32 GB/s 5.28 GB/sActual memory bandwidth measured on a 2×socket quad-core Xeon Harpertown

Introduction to parallel computersand parallel programming – p. 18

Page 19: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Distributed memory

The entire memory consists of several disjoint parts

A communication network is needed in between

There is not a single global memory space

A CPU (core) can directly access its own local memory

A CPU (core) cannot directly access a remote memory

A distributed-memory system is traditionally called a “multicomputer”

Introduction to parallel computersand parallel programming – p. 19

Page 20: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Comparing shared memory and distributed memory

SharedmemoryUser-friendly programmingData sharing between processorsNot cost effectiveSynchronization needed

Distributed-memoryMemory is scalable with the number of processorsCost effectiveProgrammer responsible for data communication

Introduction to parallel computersand parallel programming – p. 20

Page 21: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Hybrid memory system

Memory

Core Core CoreCore

Cache Cache

Core Core CoreCore

Cache Cache

Bus

Compute Node

Memory

Core Core CoreCore

Cache Cache

Core Core CoreCore

Cache Cache

Bus

Compute Node

Memory

Core Core CoreCore

Cache Cache

Core Core CoreCore

Cache Cache

Bus

Compute NodeIn

terc

onne

ct N

etw

ork

Introduction to parallel computersand parallel programming – p. 21

Page 22: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Different ways of parallel programming

Threads model using OpenMPEasy to program (inserting a few OpenMP directives)Parallelism "behind the scene" (little user control)Difficult to scale to many CPUs (NUMA, cache coherence)

Message passing model using MPIMany programming detailsBetter user control (data & work decomposition)Larger systems and better performance

Stream-based programming (for using GPUs)

Some special parallel languagesCo-Array Fortran, Unified Parallel C, Titanium

Hybrid parallel programming

Introduction to parallel computersand parallel programming – p. 22

Page 23: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Designing parallel programs

Determine whether or not the problem is parallelizable

Identify “hotspots”Where are most of the computations?Parallelization should focus on the hotspots

Partition the problem

Insert collaboration (unless embarrassingly parallel)

Introduction to parallel computersand parallel programming – p. 23

Page 24: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Partitioning

Break the problem into “chunks”

Domain decomposition (data decomposition)

Functional decomposition

https://computing.llnl.gov/tutorials/parallel comp/Introduction to parallel computersand parallel programming – p. 24

Page 25: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Examples of domain decomposition

https://computing.llnl.gov/tutorials/parallel comp/

Introduction to parallel computersand parallel programming – p. 25

Page 26: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Collaboration

CommunicationOverhead depends on both the number and size of messagesOverlap communication with computation, if possibleDifferent types of communications (one-to-one, collective)

SynchronizationBarrierLock & semaphoreSynchronous communication operations

Introduction to parallel computersand parallel programming – p. 26

Page 27: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Load balancing

Objective: idle time is minimized

Important for parallel performance

Balanced partitioning of work (and/or data)

Dynamic work assignment may be necessary

Introduction to parallel computersand parallel programming – p. 27

Page 28: Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Granularity

Computations are typically separated from communications bysynchronization events

Granularity: ratio of computation to communication

Fine-grain parallelismIndividual tasks are relatively smallMore overhead incurredMight be easier for load balancing

Coarse-grain parallelismIndividual tasks are relatively largeAdvantageous for performance due to lower overheadMight be harder for load balancing

https://computing.llnl.gov/tutorials/parallel comp/

Introduction to parallel computersand parallel programming – p. 28