Tema 1: Introduccióstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema1.pdfnetwork processor Intel Core Duo processor with two cores, a unified 2 MB L2 cache, and 152 million transistors

1

Tema 1: Introducció

Eduard Ayguadé i Josep Llosa

These slides have been prepared using some material which is part of the teaching material of Prof. Mateo Valero at the ComputerArchitecture Departament and Barcelona Supercomputing Center. Some processor diagrams have been extracted from the“Microprocessor Report journal, Copyright In/Stat&MDR.” Other material available through the internet has also been used toprepare this chapter’s slides.

Why do we need high-performance computing?Why do we need high-performance computing?

Computational Needs of Technical, Scientific, Digital Media and Business Applications

CFD Wing Simulation512x64x256 Grid(8.3 x10e6 mesh points)5000 FLOPS per mesh point,5000 time steps/cycles2.15x10e14 FLOPS

CFD Full Plane Simulation512x64x256 Grid(3.5 x10e17 mesh points)5000 FLOPS per mesh point,5000 time steps/cycles8.7x10e24 FLOPSSource: A. Jameson, et al

Materials Science

Magnetic Material:Current: 2000 atoms; 2.64 TF/s, 512 GBFuture: HDD Simulation - 30 TF/s, 2 TB

Electronic Structures:Current: 300 atoms; 0.5 TF/s, 100 GBFuture: 3000 atoms; 50 TF/s, 2TB

Source: D. Balley, NERSC

Digital Movies and Special Effects

~1e14 FLOPs per frame50 frames/sec90 minute movie- 2.7e19 FLOPs

Source: Pixar

Spare Parts Inventory Planning

Modelling the optimized deployment of 10000 part numbers across 100 part depots and requries:- 2x10e14 FLOP/s (12 hours on 10, 650 MHz CPUs)

- 2.4 PetaFlop/s sust. performance(1 hour turn-around time)

Industry trend to rapid, frequent modeling for timely business decision support driver higher sustained performanceSource: B. Dietrich, IBM

2

Operations and execution timeOperations and execution time

1012 1019

Digital movies

1014

CFD Wing Simulation

1024

CFD Full Plane

1 Tflop/s

1 Gflop/s1 Mflop/s

1012 1010 104 102 101109 108

1.5 minutes3 hours

32 years320 years

40 months

32000 years

Operationsper second

Time

1 Pflop/s

1.2 days

105

4 months

107

Current high-end microprocessorsCurrent high-end microprocessors

Frequency: 2.2 GHzPeak: 8.8 GFlops

PowerPC 970FX

3

Current multi-core architecturesCurrent multi-core architectures

IBM’s triple-core processor forXbox 360 videogame console

8 multithreading processors on a 333 milliontransistor chip for Raza Microelectronicsnetwork processor

Intel Core Duo processor with twocores, a unified 2 MB L2 cache, and152 million transistors


Intel Pentium D (dual core) and eXtreme Edition (dualCore, multithreaded), 2 x 1MB L2 cache, up to 3.2 GHz

Sun Niagara 1 (UltraSPARC T1) package and block diagram, running at 1.0 and 1.2 GHz, packaging 4, 6 and 8 active cores

4


Cell processor architecture with 1 PowerPC (PPU) core and 8 Synergistic Processing Units (SPU, 32 GFlops each), 256 GFlops


IBM Power 5Dual-core SMT processor

8-way superscalar SMT cores 276M transistors, 389 mm2 dieOperating in lab at 1.8GHz & 1.3V1.9MB L2 cache – point of coherencyOn-chip L3 directory, memory

controller

Technology130nm lithography, SOICu wiring, 8 layers of metal

High-speed elastic bus interfaceI/Os: 2313 signal, 3057 power

5


IBM Power5 multi-chip module

95mm x 95mm

Four POWER5 chips2 processors per chip2-way simultaneous multithreaded

Four L3 cache chips

4,491 signal I/Os

89 layers of metalMemoryI/OJTAG

On-BookOff-Book

POW

ER

5

L3 POWER5

L3

POWER5

L3

POW

ER

5

L3

Current high-performance µP (8/06)Current high-performance µP (8/06)

6

Current high-performance µP (8/06)Current high-performance µP (8/06)

Operations and execution timeOperations and execution time

1012 1019

Digital movies

1014

CFD Wing Simulation

1024

CFD Full Plane

256 Gflop/s

1012 1010 104 102 101109 108

4 minutes125000 years

Operationsper second

Time

105

4 months

107 1061011 1031013

7

www.top500.org (november 2005)www.top500.org (november 2005)

GFlops

MareNostrum: 4564 cpu, 40 TFlopsMareNostrum: 4564 cpu, 40 TFlops

8


JS20 Processor Blade2-way PPC970FX symmetric multiprocessor (SMP)4 GB memory, shared memory40 GBytes local IDE diskMyrinet network adapter and 2 Gigabit ports

MareNostrum 1.5MareNostrum 1.5

PowerPC 970MP

9

Blade Center• 14 blades per chassis (7U)

• 28 processors• 56GB memory

• Gigabit ethernet switch

6 chassis in a rack (42U)• 168 processors(1.4 TFlops)

• 336GB memory



Myrinet Clos256+256 switch

10

Clos 256x256Clos 256x256





Spine 1280 Spine 1280

256 links (1 to each node)250MB/s each direction

128 Links


11

Blade centers

Myrinet racks

Storage servers

Operations rack

Gigabit switch

10/100 switches


NASA Columbia SupercomputerNASA Columbia Supercomputer

Global shared memory across 4 cpus, 8 Gigabyte

4 Itanium2 per C-Brick

12


Global shared memory across 64 cpus, 128 Gigabyte


Global shared memory across 512 cpus, 1 Terabyte

13


20 SGI® Altix™ 3700 superclusters, each with 512 Itanium2 processors (1.5 GHz, 6 MB cache)

Infiniband network to connect superclusters

35 years of microprocessor history35 years of microprocessor history4004 8008 8080 8085

Pentium804868038680286

8086

Pentium IIPentium III Pentium 4 Pentium D

14

35 years of microprocessor history35 years of microprocessor history

Integration scale has increasedFeature size: 10 microns in 1971 to 0.10 microns in 2005

Technology: SIA roadmapTechnology: SIA roadmap

Year 1999 2002 2005 2008 2011 2014

Feature size (nm) 180 130 100 70 50 35

Logic trans/cm2 6.2M 18M 39M 84M 180M 390M

Cost/trans (mc) 1.735 .580 .255 .110 .049 .022

#pads/chip 1867 2553 3492 4776 6532 8935

Clock (MHz) 1250 2100 3500 6000 10000 16900

Chip size (mm2) 340 430 530 620 750 900

Wiring levels 6-7 7 7-8 8-9 9 10

Power supply (V) 1.8 1.5 1.2 0.9 0.6 0.5

High-perf pow (W) 90 130 160 170 175 183

15

35 years of microprocessor history35 years of microprocessor history

•• 13X due to process 13X due to process technologytechnology

•• Additional 4X due to Additional 4X due to microarchitecturemicroarchitecture

10

100

1.000

10.000

1.0µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ

Frequency (MHz)

Freq (uArch)Freq (Process)

13X

4X

i486Pentium® proc

Pentium® 4 proc

Pentium® II and III proc

Frequency Increased 50XFrequency Increased 50X

1

10

100

1.0µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ

Relative Performance

RelativePerformanceRelativeFrequency

13X

6X

i486

Pentium® proc

Pentium® 4 proc

Pentium® II and III proc

•• 13X due to frequency13X due to frequency

•• Additional >6X due Additional >6X due to microarchitectureto microarchitectureand designand design

Performance Increased >75XPerformance Increased >75X

*Note: Performance measured using *Note: Performance measured using SpecINTSpecINTand and SpecFPSpecFP

Basic conceptsBasic concepts

Instruction types:Load/StoreOperationControl

ControlUnit

Memory

Instructions + Data

. . .

Register File

Instructionsload Rx := M[]store M[] := Rx

Ri := Rj op Rk

Branch (cond.)

Processor

16

Execution of a programExecution of a program

• N – Number of instructions• Architecture: CISC, RISC, vector• Compiler

• CPI – cycles to execute one instruction• Architecture: CISC, RISC, pipelined,

superscalar, vector• Computer structure

• tc – processor cycle time• Computer Structure• Technology

T = N * CPI * tc

Pipelined processorsPipelined processors

Von Neumann (IPC = 1/5)

Pipelined (IPC <= 1)

F D R E W F D R E W F D R E WInstructioni-1 Instructioni Instructioni+1

F D R E W

F D R E W

F D R E W

F D R E W

F D R E W

F D R E W

Time

T = N * * tc1

IPC

17

Superscalar processorsSuperscalar processors

Out of order (IPC <= 3)

F D R W

F D R E W

F D R W

F D R E W

F D R E W

F D R E W

F D R E W

F D R E W

F D R W

E

E

E

E

E

E

E

E

E

E

E

E

E E

E

E

E

E

E

E

E E E

Time

T = N * * tc1

IPC

Processor - DRAM gapProcessor - DRAM gap

Processor - memory performance gap

Size / Speed / Cost trade-offsStatic memories: small / fast / expensiveDynamic memories: large / slow / cheapDisks: huge / very slow / very cheap

µProc60%/yr.

DRAM7%/yr.

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Perf

orm

ance

Time

18

Solution: memory hierarchySolution: memory hierarchy

Earth Simulator (Japan)Earth Simulator (Japan)

640 nodes, each node with 8 NEC SX vector processors (8 Gflop/s peak per processor), 2 ns tc

Total of 5104 total processors, 40 TFlop/s peak

19

Vector processorVector processor

ControlUnit

Main Memory

Instructions (scalar + vector) + Data

Ri := Rj op Rk

Branch (cond.)

Instr.. . .

Vector Reg.

. . .

Scalar Reg.

Vector dataScalar data VR[i] := VR[j] op VR[k]

Evaluating computer system performanceEvaluating computer system performance

Which metrics can be used to evaluate computer system performance?Execution time

Also known as wall clock time, elapsed time, response timeTotal time to complete a taskExample: hit RETURN, how long until the answer appears on the screen

ThroughputAlso known as bandwidthTotal number of operations (such as instructions, memory requests, programs) completed per unit time (rate)

Performance improves when execution time is reduced or throughput is increased

20

CPU time breakdownCPU time breakdown

CPU_Time = cycles x tc

cycles: total cycles to execute the programtc: clock cycle time (clock period)

1/frequency

cycles = INST x CPI / WINST: total number of instructions executedCPI: average number of clock cycles executed per instruction

Total clock cycles divided by total instructions executed (CYCLES/INST)Different instruction types (add, divide, etc.) may take different numbers of clock cycles to execute

W: number of instructions executed per cycle

CPU time breakdownCPU time breakdown

CPU_Time = Σ (INSTi x CPIi) x W-1 x tcINSTi: number of instructions of type i executedCPIi: numbers of clock cycles to execute an instruction of type I

Is it a good metric?INST is influenced by the compiler and its optimizationsThe whole system is evaluated (CPU, memory, I/O)

21

The INST-CPI tradeoffThe INST-CPI tradeoff

In creating a machine code equivalent to a HLL construct, the compiler writer may have several choices that differ in INST and CPI

Best solution may involve more, simpler instructions

Example: multiply by constant 5:

or

See Exercise 3

muli $2, $4 , 5 4 cycles

sll $2, $4 , 2 1 cycleadd $2, $2, $4 1 cycle

CPU metricsCPU metrics

tc is not a good metric to compare processors:Example: Pentium and Pentium III, both running at 500 MHz

CPI is not a good metric to compare processors:Intel Pentium: 5 cycles integer data pathIntel Pentium Pro (basis for Pentium II and III): 14 cycles integerdata pathPentium IV: 20 cycles integer data path

What’s behind high CPI values?

Can both be combined in a single metric?

22

CPU metricsCPU metrics

MIPS: Millions of Instructions Per Second:INST x 10-6 / CPU_Time = (CPI / W x tc)-1

MFLOP: Millions of FLOating Point operations per second:INST (only FP) x 10-6 / CPU_Time

Peak metrics:MIPSpeak = IPC x W x f (f in MHz)

MFLOPpeak = IPC (only FP) x W (only FP) x fIPC: maximum number of instructions completed per cycle

CPU metrics: examplesCPU metrics: examples

f = 200 MHz, 1 integer FU, 1 floating-point FUW=1 → 200 MIPS, 200 MFLOPsW=2 → 400 MIPS, 200 MFLOPs

Consider the following program:

which is executed in a machine with W=1, f=100MHz. Compute CPI, MIPS and MFLOPs if CPU_Time=8µs.

real a[64], b[64], c[64]; /* 8 bytes each */integer i;…for (i=1; i<=64; i++)

c[i] = a[i] + b[i];…

23

Evaluating computer system performanceEvaluating computer system performance

A workload is a collection of programs

Ideally, a user needs to evaluate the performance of its workload on a given machine before deciding whether to purchase it

This is impractical because:There are too many unique workloads to evaluatePrograms may be compiled for a different system and the source code may not be available or easily portable (get to run on a different machine)

So how can we as customers make purchase decisions without being able to run our programs on different machines?

BenchmarksBenchmarks

Benchmarks are programs specifically chosen to measure performance

Benchmark suites attempt to mimic the workloads of particular user communities

Scientific benchmarks, business benchmarks, consumer benchmarks, etc.

Computer manufacturers report performance results for benchmarks to aid users in making machine comparisons

The hope is that most user workloads can be represented well enough by a modest set of benchmark suites

24

Example: SPEC CPU2000Example: SPEC CPU2000

Comprised of SPECint2000 and SPECfp2000 benchmarks

SPECint2000 programs164.gzip: Data compression utility175.vpr: FPGA circuit placement and routing176.gcc: C compiler181.mcf: Minimum cost network flow solver186.crafty: Chess program197.parser: Natural language processing252.eon: Ray tracing253.perlbmk: Perl254.gap: Computational group theory255.vortex: Object-oriented database256.bzip2: Data compression utility300.twolf: Place and route simulator

Example: SPEC CPU2000Example: SPEC CPU2000

SPECfp2000 programs168.wupwise: Quantum chromodynamics171.swim: Shallow water modeling172.mgrid: Multi-grid solver in 3D potential field173.applu: Parabolic/elliptic partial differential equations177.mesa: 3D graphics library178.galgel: Fluid dynamics: analysis of oscillatory instability179.art: Neural network simulation: adaptive resonance theory183.equake: Finite element simulation: earthquake modeling187.facerec: Computer vision: recognizes faces188.ammp: Computational chemistry189.lucas: Number theory: primality testing191.fma3d: Finite-element crash simulation200.sixtrack: Particle accelerator model301.apsi: Solves problems regarding temperature, wind, distribution of pollutants

25

SpeedUpSpeedUp

How faster we go when a new feature is added to oursystem:

Sideal = CPU_Time (ref) / CPU_Time (enhanced) == (CPIref / Wref x tcref) / (CPIenhanced / Wenhanced x tcenhanced) == αref / αenhanced

Amdhal’s law:Performance improvement of an enhancement is limited by the fraction of time the enhancement is used, e.g. ϕ:

S = (N x αref) / ((N x (1-ϕ) x αref) + (N x ϕ x αenhanced))= 1 / ((1-ϕ) + ϕ/Sideal)

Amdahl’s law exampleAmdahl’s law example

Assume an enhancement with Sideal=1000, and that it can be applied 80% of the total time:

S = 1 / (0.2 + 0.8/1000) ≅ 5

Or even worse, you got it!!! Sideal = ∞, then

S = 1 / (1 - ϕ)

26

Capacity metricsCapacity metrics

How is memory capacity measured?

How much is 1K? 103 or 210

How much is 1 KHz?

and 1 Mbit/s?

and 1 Gbyte?

What is bigger? 1 Gbyte of disk or 1 Gbyte of memory?

☺ ¿Qué pesa más, un kilo de hierro o un kilo de paja? ☺


1027290Xenta/Xora/Bronto1024280YYotta1021270ZZetta1018260EExa1015250PPeta1012240TTera109230GGiga106220MMega103 =1000210 =1024KKiloBase 10Base 2SymbolName

Names and Symbols for different magnitudes

27


The difference grows with the order of magnitude (2.4% for Kilo, 20.8% for Yotta)

THE PROBLEM: the same word is used for both powers of 2 and powers of 10. Either of them is used depending on the context:

Hertz (Hz) are measured in powers of 10a processor running at 1 Gigahertz (GHz) runs at 1.000.000.000 Hz.

Transmission speed is measured in powers of 10a 128 Kbit/s mp3 stream, transfers 128.000 bits per seconda 1 Mbit/s ADSL connection, transfers at most 1.000.000 bits per second

Bus bandwidth is also measured in powers of 10RAM memory is measured in powers of 2

1MB of RAM is 220 bytes of RAM.

What about storage devices?


Hard Disks (HD) are measured in powers of 10.A HD of 30GB has 30x109 bytes (28x230 aprox. )

It is not marketing, but tradition: the physical structure of a disk (plates, tracks, sectors) is not required to be a power of 2.

However, the operating system shows disk capacity in powers of 2.

It seems to be done for coherence between RAM capacity and HD capacity.Therefore, a laptop that has, according to the vendor, 1GB of RAM and 30 GB of HD, will be seen under Windows as 1GB of RAM and 28 GB of HD

Can things be more confusing? Of course, yes !!!

28


Some devices use hybrid measurement systems:A floppy disc of 1,44 MB is not 1,44x106 nor 1,44x220, but 1,44x1000x1024 bytes (1,406 MB in base 2 or 1,475 MB in base 10)

Finally: CDs and DVDs also use different measurementsCD capacity is given in powers of 2

a 700MB CD =“80 minutes” has 700x220 bytesDVD capacity is given in powers of 10

a DVD of 4.7 GB has 4.7x109 bytes = 4.38x230 bytes

Tema 1: Introduccióstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema1.pdfnetwork processor Intel Core Duo processor with two cores, a unified 2 MB L2 cache, and 152 million transistors

Documents