Tema 1: Introducció Eduard Ayguadé i Josep Llosa These slides have been prepared using some material which is part of the teaching material of Prof. Mateo Valero at the Computer Architecture Departament and Barcelona Supercomputing Center. Some processor diagrams have been extracted from the “Microprocessor Report journal, Copyright In/Stat&MDR.” Other material available through the internet has also been used to prepare this chapter’s slides. Why do we need high-performance computing? Why do we need high-performance computing? Computational Needs of Technical, Scientific, Digital Media and Business Applications CFD Wing Simulation 512x64x256 Grid (8.3 x10e6 mesh points) 5000 FLOPS per mesh point, 5000 time steps/cycles 2.15x10e14 FLOPS CFD Full Plane Simulation 512x64x256 Grid (3.5 x10e17 mesh points) 5000 FLOPS per mesh point, 5000 time steps/cycles 8.7x10e24 FLOPS Source: A. Jameson, et al Materials Science Magnetic Material: Current: 2000 atoms; 2.64 TF/s, 512 GB Future: HDD Simulation - 30 TF/s, 2 TB Electronic Structures: Current: 300 atoms; 0.5 TF/s, 100 GB Future: 3000 atoms; 50 TF/s, 2TB Source: D. Balley, NERSC Digital Movies and Special Effects ~1e14 FLOPs per frame 50 frames/sec 90 minute movie - 2.7e19 FLOPs Source: Pixar Spare Parts Inventory Planning Modelling the optimized deployment of 10000 part numbers across 100 part depots and requries: - 2x10e14 FLOP/s (12 hours on 10, 650 MHz CPUs) - 2.4 PetaFlop/s sust. performance (1 hour turn-around time) Industry trend to rapid, frequent modeling for timely business decision support driver higher sustained performance Source: B. Dietrich, IBM
28
Embed
Tema 1: Introduccióstudies.ac.upc.edu/ETSETB/SEGPAR/slides/tema1.pdfnetwork processor Intel Core Duo processor with two cores, a unified 2 MB L2 cache, and 152 million transistors
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Tema 1: Introducció
Eduard Ayguadé i Josep Llosa
These slides have been prepared using some material which is part of the teaching material of Prof. Mateo Valero at the ComputerArchitecture Departament and Barcelona Supercomputing Center. Some processor diagrams have been extracted from the“Microprocessor Report journal, Copyright In/Stat&MDR.” Other material available through the internet has also been used toprepare this chapter’s slides.
Why do we need high-performance computing?Why do we need high-performance computing?
Computational Needs of Technical, Scientific, Digital Media and Business Applications
CFD Wing Simulation512x64x256 Grid(8.3 x10e6 mesh points)5000 FLOPS per mesh point,5000 time steps/cycles2.15x10e14 FLOPS
CFD Full Plane Simulation512x64x256 Grid(3.5 x10e17 mesh points)5000 FLOPS per mesh point,5000 time steps/cycles8.7x10e24 FLOPSSource: A. Jameson, et al
640 nodes, each node with 8 NEC SX vector processors (8 Gflop/s peak per processor), 2 ns tc
Total of 5104 total processors, 40 TFlop/s peak
19
Vector processorVector processor
ControlUnit
Main Memory
Instructions (scalar + vector) + Data
Ri := Rj op Rk
Branch (cond.)
Instr.. . .
Vector Reg.
. . .
Scalar Reg.
Vector dataScalar data VR[i] := VR[j] op VR[k]
Evaluating computer system performanceEvaluating computer system performance
Which metrics can be used to evaluate computer system performance?Execution time
Also known as wall clock time, elapsed time, response timeTotal time to complete a taskExample: hit RETURN, how long until the answer appears on the screen
ThroughputAlso known as bandwidthTotal number of operations (such as instructions, memory requests, programs) completed per unit time (rate)
Performance improves when execution time is reduced or throughput is increased
20
CPU time breakdownCPU time breakdown
CPU_Time = cycles x tc
cycles: total cycles to execute the programtc: clock cycle time (clock period)
1/frequency
cycles = INST x CPI / WINST: total number of instructions executedCPI: average number of clock cycles executed per instruction
Total clock cycles divided by total instructions executed (CYCLES/INST)Different instruction types (add, divide, etc.) may take different numbers of clock cycles to execute
W: number of instructions executed per cycle
CPU time breakdownCPU time breakdown
CPU_Time = Σ (INSTi x CPIi) x W-1 x tcINSTi: number of instructions of type i executedCPIi: numbers of clock cycles to execute an instruction of type I
Is it a good metric?INST is influenced by the compiler and its optimizationsThe whole system is evaluated (CPU, memory, I/O)
21
The INST-CPI tradeoffThe INST-CPI tradeoff
In creating a machine code equivalent to a HLL construct, the compiler writer may have several choices that differ in INST and CPI
Best solution may involve more, simpler instructions
Example: multiply by constant 5:
or
See Exercise 3
muli $2, $4 , 5 4 cycles
sll $2, $4 , 2 1 cycleadd $2, $2, $4 1 cycle
CPU metricsCPU metrics
tc is not a good metric to compare processors:Example: Pentium and Pentium III, both running at 500 MHz
CPI is not a good metric to compare processors:Intel Pentium: 5 cycles integer data pathIntel Pentium Pro (basis for Pentium II and III): 14 cycles integerdata pathPentium IV: 20 cycles integer data path
What’s behind high CPI values?
Can both be combined in a single metric?
22
CPU metricsCPU metrics
MIPS: Millions of Instructions Per Second:INST x 10-6 / CPU_Time = (CPI / W x tc)-1
MFLOP: Millions of FLOating Point operations per second:INST (only FP) x 10-6 / CPU_Time
Peak metrics:MIPSpeak = IPC x W x f (f in MHz)
MFLOPpeak = IPC (only FP) x W (only FP) x fIPC: maximum number of instructions completed per cycle
which is executed in a machine with W=1, f=100MHz. Compute CPI, MIPS and MFLOPs if CPU_Time=8µs.
real a[64], b[64], c[64]; /* 8 bytes each */integer i;…for (i=1; i<=64; i++)
c[i] = a[i] + b[i];…
23
Evaluating computer system performanceEvaluating computer system performance
A workload is a collection of programs
Ideally, a user needs to evaluate the performance of its workload on a given machine before deciding whether to purchase it
This is impractical because:There are too many unique workloads to evaluatePrograms may be compiled for a different system and the source code may not be available or easily portable (get to run on a different machine)
So how can we as customers make purchase decisions without being able to run our programs on different machines?
BenchmarksBenchmarks
Benchmarks are programs specifically chosen to measure performance
Benchmark suites attempt to mimic the workloads of particular user communities
Scientific benchmarks, business benchmarks, consumer benchmarks, etc.
Computer manufacturers report performance results for benchmarks to aid users in making machine comparisons
The hope is that most user workloads can be represented well enough by a modest set of benchmark suites
24
Example: SPEC CPU2000Example: SPEC CPU2000
Comprised of SPECint2000 and SPECfp2000 benchmarks
SPECint2000 programs164.gzip: Data compression utility175.vpr: FPGA circuit placement and routing176.gcc: C compiler181.mcf: Minimum cost network flow solver186.crafty: Chess program197.parser: Natural language processing252.eon: Ray tracing253.perlbmk: Perl254.gap: Computational group theory255.vortex: Object-oriented database256.bzip2: Data compression utility300.twolf: Place and route simulator
Example: SPEC CPU2000Example: SPEC CPU2000
SPECfp2000 programs168.wupwise: Quantum chromodynamics171.swim: Shallow water modeling172.mgrid: Multi-grid solver in 3D potential field173.applu: Parabolic/elliptic partial differential equations177.mesa: 3D graphics library178.galgel: Fluid dynamics: analysis of oscillatory instability179.art: Neural network simulation: adaptive resonance theory183.equake: Finite element simulation: earthquake modeling187.facerec: Computer vision: recognizes faces188.ammp: Computational chemistry189.lucas: Number theory: primality testing191.fma3d: Finite-element crash simulation200.sixtrack: Particle accelerator model301.apsi: Solves problems regarding temperature, wind, distribution of pollutants
25
SpeedUpSpeedUp
How faster we go when a new feature is added to oursystem:
The difference grows with the order of magnitude (2.4% for Kilo, 20.8% for Yotta)
THE PROBLEM: the same word is used for both powers of 2 and powers of 10. Either of them is used depending on the context:
Hertz (Hz) are measured in powers of 10a processor running at 1 Gigahertz (GHz) runs at 1.000.000.000 Hz.
Transmission speed is measured in powers of 10a 128 Kbit/s mp3 stream, transfers 128.000 bits per seconda 1 Mbit/s ADSL connection, transfers at most 1.000.000 bits per second
Bus bandwidth is also measured in powers of 10RAM memory is measured in powers of 2
1MB of RAM is 220 bytes of RAM.
What about storage devices?
Capacity metricsCapacity metrics
Hard Disks (HD) are measured in powers of 10.A HD of 30GB has 30x109 bytes (28x230 aprox. )
It is not marketing, but tradition: the physical structure of a disk (plates, tracks, sectors) is not required to be a power of 2.
However, the operating system shows disk capacity in powers of 2.
It seems to be done for coherence between RAM capacity and HD capacity.Therefore, a laptop that has, according to the vendor, 1GB of RAM and 30 GB of HD, will be seen under Windows as 1GB of RAM and 28 GB of HD
Can things be more confusing? Of course, yes !!!
28
Capacity metricsCapacity metrics
Some devices use hybrid measurement systems:A floppy disc of 1,44 MB is not 1,44x106 nor 1,44x220, but 1,44x1000x1024 bytes (1,406 MB in base 2 or 1,475 MB in base 10)
Finally: CDs and DVDs also use different measurementsCD capacity is given in powers of 2
a 700MB CD =“80 minutes” has 700x220 bytesDVD capacity is given in powers of 10
a DVD of 4.7 GB has 4.7x109 bytes = 4.38x230 bytes