Top Banner
CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily… CS8625 High Performance and Parallel Computing Dr. Ken Hoganson Intro Parallel Architectures
24

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

Dec 25, 2015

Download

Documents

Kristian Small
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

CS8625-June-2-08

ClassWill

Start Momentarily…

CS8625 High Performance and Parallel ComputingDr. Ken Hoganson

Intro Parallel Architectures

Page 2: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Server Hardware

• Mission-critical– High reliability– redundancy

• Massive storage (disk)– RAID for redundancy

• High performance through replication of components– Multiple processors– Multiple buses– Multiple hard drives– Multiple network interfaces

Page 3: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Computing Paradigms

Mainframe

TerminalTerminal

TerminalTerminal

“Old” computing paradigm: mainframe/terminal with centralized processing and storage

Failed 1st client/server computing paradigm: Decentralized processing and storage

PCPC PCServer

SERVER

PC Successful 2nd client/server computing paradigm: strong centralized processing and storage

PC

PC

PC

Page 4: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Evolving Computing Paradigm

1950

1960

1970

1980

1990

2000

Processing and Storage Locality

Centralized Decentralized

?

Clusters,

Servers

Distributed, Grid

Page 5: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Mainframe: the Ultimate Server?

• Client/server architecture was originally predicted to bring about the demise of the mainframe.

• Critical corporate data must reside on a highly reliable high performance machine

• Early PC networks did not have the needed performance or reliability– NOW (Network Of Workstations)– LAN (Local Area Network)

• Some firms, after experience with client/server problems, returned to the mainframe for critical corporate data and functions

• Modern computing paradigm combines – powerful servers (including mainframes when

needed) where critical corporate data and information resides

– With decentralized processing and non-critical storage on PCs

– Interconnected with a network

Page 6: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Multiprocessor Servers

• Multiprocessor servers offer high performance at much lower cost than a traditional mainframe

• Uses inexpensive, “off-the-shelf” components• Combine multiple PCs or workstations in one

box• Processors cooperate to complete the work• Processors share resources and memory• One of the implementations of Parallel

Processing• Blade Cluster in process of development

– 10 Blades– Each Blade has 2 CPUs, memory, disk

Page 7: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

5 Parallel Levels

5 levels of parallelism have been identifiedEach level has both a software level parallelism, and

a hardware implementation that accommodates or implements the software parallelism

Sources: • The Unified Parallel Speedup Model and Simulator, K. Hoganson, SE-ACM 2001, March 2001• Alternative Mechanisms to Achieve Parallel Speedup, K. Hoganson, First IEEE Online Symposium for

Electronics Engineers, IEEE Society, November 2000. • Workload Execution Strategies and Parallel Speedup on Clustered Computers, K. Hoganson, IEEE

Transactions on Computers, Vol. 48, No. 11, November 1999.

Software Hardware Implementation

1 Intra-Instruction Pipeline

2 Inter-Instruction Super-Scalar, multiple pipelines

3 Algorithm/Thread/Object MultiProcessor

4 Multi-Process Clustered-Multiprocessor

5 Distributed/N-Tier CS Multicomputer/Internet/Web

Page 8: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Terminology

• Thread - a lightweight process, easy (efficient) to multi-task between.

• Multiprocessor - a computer system with multiple processors combined in a single system (in a single box or frame). Usually share memory and other resources between processors.

• Multicomputer - multiple discrete computers with separate memory and etc. Interconnected with a network.

• Clustered computer - a multiprocessor OR multicomputer that builds two levels of interconnection between processors– Intra-Cluster connection (within cluster)– Inter-Cluster connection (between clusters)

• Distributed Computer - a loosely coupled multicomputer – a n-Tiered Client/Server computing system is an example of distributed computing

Page 9: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Clustered Multiprocessor

CPU

CPU

CPU

CPU

Cache

CPU

CPU

CPU

CPU

Cache

CPU

CPU

CPU

CPU

CacheI/O

I/OMEMMEM

Page 10: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Multi-Computer

CPU MEM

I/O I/O

NIC

CPU MEM

I/O I/O

NIC

CPU MEM

I/O I/O

NIC

Page 11: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Level 5 N-Tier Client-Server

LAN

C

C

C

S

C

C

C

LAN G Internet G

W S

S

S

W

W

Client Tier Server Tier

Client Tier (1) Server Tier (2) Server Tier (3)

LAN

PA2 & AveLat2

PA3 & AveLat3

C - Client WorkstationS - Data Server G - GatewayW - Web Host Server

Figure 2. N-Tier Architectures

Page 12: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Flynn’s Classification

• Old idea, still useful. • Examines parallelism from the point of view

of what is the parallel scope of an instruction

• SISD - Single Instruction, Single Data: Each instruction operates on a single data item

• SIMD - Single Instruction, Multiple Data: Each instruction operates on multiple data items simultaneously (classic supercomputing)

• MIMD - Multiple Instruction, Multiple Data: Separate Instruction/Data streams. Super-scalar, multiprocessors, multicomputers.

• MISD - Multiple Instruction Single Data: No know examples

Page 13: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Symmetric Multiprocessing

• Asymmetric Multiprocessing: – multiple unique processors, each

dedicated to a special function– PC is an example

• Symmetric Multiprocessing:– multiple identical processors able to work

together on parallel problems• Homogenous system: a symmetric

multiprocessor• Heterogenous system: different “makes” or

models of processors combined in a system. Example: distributed system with different types of PCs with different processors

Page 14: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Classic Model: Parallel Processing

• Multiple Processors available (4)

• A Process can be divided into serial and parallel portions

• The parallel parts are executed concurrently

• Serial Time: 10 time units

• Parallel Time: 4 time units

S - Serial or non-parallel portionA - All A parts can be executed concurrentlyB - All B parts can be executed concurrentlyAll A parts must be completed prior to executing the B parts

An example parallel process of time 10:

%5.624

5.2

5.24

10

processors

SpeedupEfficiency

meParallelTi

SerialTimeSpeedup

Executed on a single processor:

Executed in parallel on 4 processors:

S A A A A B B B B S

SA

A

A

A

B

B

B

B

S

Page 15: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Amdahl’s Law (Analytical Model)

• Analytical model of parallel speedup from 1960s

• Parallel fraction () is run over n processors taking /n time

• The part that must be executed in serial (1- ) gets no speedup

• Overall performance is limited by the fraction of the work that cannot be done in parallel (1- )

• diminishing returns with increasing processors (n)

processors ofnumber parallelin done

becan work thatoffraction

,)1(

1

n

where

n

meParallelTiSerialTimeSpeedup

Page 16: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Pipelined Processing

• Single Processor enhanced with discrete stages

• Instructions “flow” through pipeline stages

• Parallel Speedup with multiple instructions being executed (by parts) simultaneously

• Realized speedup is partly determined by the number of stages: 5 stages=at most 5 times faster

F - Instruction Fetch

D - Instruction Decode

OF - Operand Fetch

EX - Execute

WB - Write Back or Result Store

Processor clock/cycle is divided into sub-cycles, each stage takes one sub-cycle

OFIF D WBEX

Cycle: 1 2 3 4 5

Page 17: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Pipeline Performance

• Speedup is serial time (nS) over parallel time

• Performance is limited by the number of pipeline flushes (n) due to jumps

• speculative execution and branch prediction can minimize pipeline flushes

• Performance is also reduced by pipeline stalls (s), due to conflicts with bus access, data not ready delays, and other sources

nsnS

nSSpeedup

s

n

S

1

(%) stalls pipeline of frequence

flushes pipelinebetween

nsinstructio ofnumber Average

STAGES pipeline ofNumber

Page 18: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Super-Scalar: Multiple Pipelines

• Concurrent Execution of Multiple sets of

instructions

• Example: Simultaneous execution of instructions

though an integer pipeline while processing

instructions through a floating point pipeline

• Compiler: identifies and specifies separate

instruction sets for concurrent execution through

different pipes

Page 19: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Algorithm/Thread Parallelism

• Parallel “threads of execution”– could be a separate

process OR– could be a multi-thread

process• Each thread of execution

obeys Amdahl’s parallel speedup model

• Multiple concurrently executing processes resulting in:

• Multiple serial components executing concurrently - another level of parallelism

SA

A B

BS

SA

A B

BS

P1

P2

Observe that the serial parts of Program 1 and Program 2 are now running in parallel with each other.Each program would take 6 time units on a uniprocessor, or a total workload serial time of 12. Each has a speedup of 1.5.The total speedup is 12/4 = 3, which is also the sum of the program speedups.

Page 20: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Multiprocess Speedup

• Concurrent Execution of Multiple Processes not related.

• Each process is limited by Amdahl’s parallel speedup

• Multiple concurrently executing processes resulting in:

• Multiple serial components executing concurrently - another level of parallelism

• Avoid Degree of Parallelism (DOP) speedup limitations

• Linear scaling up to machine limits of processors and memory: n single process speedup

Two

SA

A B

BS

SA

A B

BS

S AA B B SS AA B B S

No speedup - uniprocessor 12 t

Single Process 8 t, Speedup = 1.5

SA

A B

BS

Multi-Process 4 t, Speedup = 3

SA

A B

BS

Page 21: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Algorithm/Thread Analytical

Multi-Process/Thread Speedup = fraction of work that can be done in paralleln=number of processorsN = number concurrent (assumed similar) processes or threads

n

NSpeedup )1(

1

N

i

i

iin

Speedup1 )1(

1

Multi-Process/Thread Speedup = fraction of work that can be done in paralleln=number of processors in systemni=number of processors used by process iN = number concurrent (assumed dissimilar) processes or threads

Page 22: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Realizing Multiple Levels of Parallelism

• Most parallelism suffers from diminishing returns - resulting in limited scalability.

• Allocating hardware resources to capture multiple levels of parallelism - operate at efficient end of speedup curves.

• Manufacturers of microcontrollers are integrating multiple levels of parallelism on a single chip

Efficiency: Speedup/N

0

0.2

0.4

0.6

0.8

1

2 4 8 16 32 64

Number of Processors (N)

% P

aral

lel 0.99

0.95

0.9

0.8

Page 23: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

End of Lecture

End Of

Today’s

Lecture.

Page 24: CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

CS 8625 High Performance and Parallel, Dr. Hoganson

Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson

Blank Slide