Top Banner
Chapter 1: Perspectives Copyright @ 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical, photocopying, recording, or otherwise) without the prior written permission of the author. An exception is granted for academic lectures at universities and colleges, provided that the following text is included in such copy: “Source: Yan Solihin, Fundamentals of Parallel Computer Architecture, 2008”.
37

Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Jan 17, 2018

Download

Documents

Clifton Collins

Fundamentals of Computer Architecture - Chapter 13 Evolution in Microprocessors
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Chapter 1: Perspectives

Copyright @ 2005-2008 Yan Solihin

Copyright notice: No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical, photocopying, recording, or otherwise) without the prior written permission of the author. An exception is granted for academic lectures at universities and colleges, provided that the following text is included in such copy: “Source: Yan Solihin, Fundamentals of Parallel Computer Architecture, 2008”.

Page 2: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 2

Outline for Lecture 1 Introduction Types of parallelism Architectural trends Why parallel computers? Scope of CSC/ECE 506

Page 3: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 3

Evolution in Microprocessors

QuickTime™ and a decompressor

are needed to see this picture.

Page 4: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 4

Key Points More and more components can be integrated on a

single chip Speed of integration tracks Moore’s law, doubling every

18–24 months. Exercise: Look up how the number of transistors per chip

has changed, esp. since 2006. Submit here. Until recently, performance tracked speed of integration At the architectural level, two techniques facilitated this:

Instruction-level parallelism Cache memory

Performance gain from uniprocessor system was high enough that multiprocessor systems were not viable for most uses.

Page 5: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 5

Illustration 100-processor system with perfect speedup Compared to a single processor system

Year 1: 100x faster Year 2: 62.5x faster Year 3: 39x faster … Year 10: 0.9x faster

Single-processor performance catches up in just a few years!

Even worse It takes longer to develop a multiprocessor system Low volume means prices must be very high High prices delay adoption Perfect speedup is unattainable

Page 6: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 6

Why did uniprocessor performance grow so fast? ≈ half from circuit improvement (smaller transistors,

faster clock, etc.) ≈ half from architecture/organization:

Instruction-level parallelism (ILP) Pipelining: RISC, CISC with RISC back-end Superscalar Out-of-order execution

Memory hierarchy (caches) Exploit spatial and temporal locality Multiple cache levels

Page 7: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 7

But uniprocessor perf. growth is stalling Source of uniprocessor performance growth: ILP

Parallel execution of independent instructions from a single thread

ILP growth has slowed abruptly Memory wall: Processor speed grows at 55%/year,

memory speed grows at 7% per year ILP wall: achieving higher ILP requires quadratically

increasing complexity (and power)

Power efficiency Thermal packaging limit vs. cost

Page 8: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 8

Instruction level (cf. ECE 521) Pipelining

Types of parallelism

A (a load)

B

C

IF ID MEMEX WB

IF ID MEMEX WB

IF ID MEMEX WB

Page 9: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 9

Types of parallelism, cont. Superscalar/ VLIW Original:

Schedule as:

+ Moderate degree of parallelism (sometimes 50)– Requires fast communication (register level)

LD F0, 34(R2)

ADDD F4, F0, F2

LD F7, 45(R3)

ADDD F8, F7, F6

LD F0, 34(R2) | LD F7, 45(R3)ADDD F4, F0, F2 | ADDD F8, F0, F6

Page 10: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 10

Why ILP is slowing Branch-prediction accuracy is already > 90%

Hard to improve it even more Number of pipeline stages is already deep (≈ 20–30

stages) But critical dependence loops do not change Memory latency requires more clock cycles to satisfy

Processor width is already high Quadratically increasing complexity to increase the width

Cache size Effective, but also shows diminishing returns In general, the size must be doubled to reduce miss rate

by a half

Page 11: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 11

Current trends: multicore and manycoreAspect Intel

ClovertownAMD Barcelona

IBM Cell

# cores 4 4 8+1Clock frequency

2.66 GHz 2.3 GHz 3.2 GHz

Core type OOO Superscalar

OOO Superscalar

2-issue SIMD

Caches 2x4MB L2 512KB L2 (private), 2MB L3 (shd)

256KB local store

Chip power 120 watts 95 watts 100 watts

Exercise: Browse the Web for information on more recent processors, and for each processor, fill out this form. (You can view the submissions here.)

Page 12: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 12

Historical perspectives 80s – early 90s: Prime time for parallel architecture

research A microprocessor cannot fit on a chip, so naturally need

multiple chips (and processors) J-machine, M-machine, Alewife, Tera, HEP, DASH, etc. Exercise: Pick one of these machines, and identify a

major innovation that it introduced. Submit here. 90s: At the low end, uniprocessor systems’ speed grows

much faster than parallel systems’ speed A microprocessor fits on a chip. So do branch predictor,

multiple functional units, large caches, etc.! Microprocessor also exploits parallelism (pipelining,

multiple-issue, VLIW) – parallelisms originally invented for multiprocessors

Many parallel computer vendors went bankrupt Prestigious but small high-performance computing market

Page 13: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 13

“If the automobile industry advanced as rapidly as the semiconductor industry, a Rolls Royce would get a half million miles per gallon and it would be cheaper to throw it away than to park it.”

Gordon Moore,Intel Corporation, 1998

Page 14: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 14

Historical perspectives, cont.

90s: Emergence of distributed (vs. parallel) machines Progress in network technologies:

Network bandwidth grows faster than Moore’s law Fast interconnection networks getting cheap

Connects cheap uniprocessor systems into a large distributed machine

Network of Workstations, Clusters, Grid 00s: Parallel architectures are back!

Transistors per chip >> microprocessor transistors Harder to get more performance from a uniprocessor SMT (Simultaneous multithreading), CMP (Chip Multi-

Processor), ultimately Massive CMP E.g. Intel Pentium D, Core Duo, AMD Dual Core, IBM Power5,

Sun Niagara.

Page 15: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 15

Parallel computers A parallel computer is a collection of processing

elements that can communicate and cooperate to solve a large problem fast. [Almasi & Gottlieb]

“collection of processing elements” How many? How powerful each? Scalability? Few very powerful (e.g., Altix) vs. many small ones

(BlueGene)

“that can communicate” How do PEs communicate? (shared memory vs. message-

passing) Interconnection network (bus, multistage, crossbar, …) Metrics: cost, latency, throughput, scalability, fault

tolerance

Page 16: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 16

Parallel computers, cont. “and cooperate”

Issues: granularity, synchronization, and autonomy Synchronization allows sequencing of operations to ensure

correctness Granularity up parallelism down, communication down,

overhead down Statement/instruction level: 2–10 instructions (ECE 521) Loop level: 10 – 1K instructions Task level: 1K – 1M instructions Program level: > 1M instructions

Autonomy SIMD (single instruction stream) vs. MIMD (multiple

instruction streams)

Page 17: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 17

Parallel computers, cont. “solve a large problem fast”

General- vs. special-purpose machine? Any machine can solve certain problems well

What domains? Highly (embarassingly) parallel apps

Many scientific codes Medium parallel apps

Many engineering apps (finite-element, VLSI-CAD) Not parallel apps

Compilers, editors (do we care?)

Page 18: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 18

Why parallel computers? Absolute performance: Can we afford to wait?

Folding of a single protein takes years to simulate on the most advanced microprocessor. It only takes days on a parallel computer

Weather forecast: timeliness is crucial Cost/performance

Harder to improve performance on a single processor Bigger monolithic processor vs. many, simple processors

Power/performance Reliability and availability Key enabling technologies

Advances in microprocessor and interconnect technology Advances in software technology

Page 19: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 19

Scope of CSC/ECE 506 Parallelism

Loop-level and task-level parallelism

Flynn taxonomy: SIMD (vector architecture) MIMD

Shared memory machines (SMP and DSM) Clusters

Programming Model: Shared memory Message-passing Hybrid

Page 20: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 20

Loop-level parallelism Sometimes each iteration can be performed

independently

Sometimes iterations cannot be performed independently no loop-level parallelism

+ Very high parallelism > 1K+ Often easy to achieve load balance– Some loops are not parallel– Some apps do not have many loops

for (i=0; i<8; i++) a[i] = b[i] + c[i];

for (i=0; i<8; i++) a[i] = b[i] + a[i-1];

Page 21: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 21

Task-level parallelism Arbitrary code segments in a single program Across loops:

Subroutines:

Threads: e.g. editor: GUI, printing, parsing

+ Larger granularity => low overheads, communication– Low degree of parallelism– Hard to balance

…for (i=0; i<n; i++) sum = sum + a[i];for (i=0; i<n; i++) prod = prod * a[i];…

Cost = getCost();A = computeSum();B = A + Cost;

Page 22: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 22

Program-level parallelism Various independent programs execute together gmake:

gcc –c code1.c // assign to proc1 gcc –c code2.c // assign to proc2 gcc –c main.c // assign to proc3 gcc main.o code1.o code2.o

+ No communication– Hard to balance– Few opportunities

Page 23: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 23

Scope of CSC/ECE 506 Parallelism

Loop-level and task-level parallelism

Flynn taxonomy: SIMD (vector architecture) MIMD

*Shared memory machines (SMP and DSM) Clusters

Programming Model: Shared memory Message-passing Hybrid

Page 24: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 24

Taxonomy of parallel computersThe Flynn taxonomy:• Single or multiple instruction streams.• Single or multiple data streams. 1. SISD machine (Most desktops, laptops)

Only one instruction fetch stream Most of yesterday’s workstations or desktops

Control unit

Instructionstream

Datastream

ALU

Page 25: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 25

SIMD Examples: Vector processors, SIMD extensions (MMX) A single instruction operates on multiple data items.

Control unit

Instruction stream

ALU 2

ALU 1

ALU n

Data stream

1

Data stream

2

Data stream

n

SISD: for (i=0; i<8; i++) a[i] = b[i] + c[i];

SIMD: a = b + c; // vector addition

Page 26: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 26

MISD Example: CMU Warp Systolic arrays

Control unit 2

ALU 2

ALU 1

ALU n

Instruction stream 1

stream 2

stream

n

Data stream

Instruction

Instruction

Control unit 1

Control unit n

Page 27: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 28

MIMD Independent processors connected together to form a

multiprocessor system. Physical organization:

Determines which memory hierarchy level is shared Programming abstraction:

Shared Memory: on a chip: Chip Multiprocessor (CMP) Interconnected by a bus: Symmetric multiprocessors

(SMP) Point-to-point interconnection: Distributed Shared Memory

(DSM) Distributed Memory:

Clusters, Grid

Page 28: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 29

MIMD Physical Organization

P

caches

M

P Shared-cache architecture: - CMP (or Simultaneous Multi-Threading)- e.g.: Pentium 4 chip, IBM Power4 chip, Sun Niagara, Pentium D, etc.- Implies shared-memory hardware

P

caches

M

P

…caches

Network

UMA (Uniform Memory Access) Shared Memory : - Pentium Pro Quad, Sun Enterprise, etc.- What interconnection network?

- Bus- Multistage- Crossbar - etc.

- Implies shared-memory hardware

Page 29: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 30

MIMD Physical Organization (2)

P

caches

M …

Network

P

caches

M

NUMA (Non-Uniform Memory Access) Shared Memory : - SGI Origin, Altix, IBM p690, AMD Hammer-based system- What interconnection network?

- Crossbar - Mesh- Hypercube- etc.

- Also referred to as Distributed Shared Memory

Page 30: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 31

MIMD Physical Organization (3)

P

caches

M

Network

P

caches

M

I/O I/O

Distributed System/Memory:- Also called clusters, grid - Don’t confuse it with distributed shared memory

Page 31: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 32

Scope of CSC/ECE 506 Parallelism

Loop-level and task-level parallelism

Flynn taxonomy: MIMD

Shared memory machines (SMP and DSM)

Programming Model: Shared memory Message-passing Hybrid (e.g., UPC) Data parallel

Page 32: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 33

Programming models: shared memory Shared Memory / Shared Address Space:

Each processor can see the entire memory

Programming model = thread programming in uniprocessor systems

P P P …

Shared M

Page 33: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 34

Programming models: message-passing Distributed Memory / Message Passing / Multiple

Address Space: a processor can only directly access its own local

memory. All communication happens by explicit messages.

P

M

P

M

P

M

P

M

Page 34: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 35

Programming models: data parallel Programming model Operations performed in parallel on each element of data structure Logically single thread of control, performs sequential or parallel

steps Conceptually, a processor associated with each data element

Architectural model Array of many simple, cheap processing elements (PEs) with little

memory each Processing elements don’t sequence through instructions

PEs are attached to a control processor that issues instructions Specialized and general communication, cheap global

synchronization Original motivation

Matches simple differential equation solvers Centralize high cost of instruction fetch/sequencing

Page 35: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 36

Top 500 supercomputers http://www.top500.org Let’s look at the Earth Simulator

Was #1 in 2004, now #10 in 2006 Hardware:

5,120 (640 8-way nodes) 500 MHz NEC CPUs 8 GFLOPS per CPU (41 TFLOPS total)

30s TFLOPS sustained performance! 2 GB (4 512 MB FPLRAM modules) per CPU (10 TB total) Shared memory inside the node 10 TB total memory 640 × 640 crossbar switch between the nodes 16 GB/s inter-node bandwidth 20 kVA power consumption per node

Page 36: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 37

Page 37: Chapter 1: Perspectives 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system,

Fundamentals of Computer Architecture - Chapter 1 38

Exercise Go to http://www.top500.org and look at the Statistics

menu near the right-hand side. Click on one of the statistics, e.g., Vendors, Processor Architecture, and examine what kind of systems are prevalent. Then do the same for earlier lists, and report on the trend. You may find interesting results by clicking on the “Development” tab.

For example, if you choose “Processor Architecture,” the current list, http://www.top500.org/stats/list/34/procarch/, will tell you how many vector and scalar architectures there are. Change the “34” to “32” to get last year’s list. Change it to a lower number to get an earlier year’s list. You can go all the way back to the first list from 1993.

Submit your results here.