Top Banner
CPU Performance Enhancements CS2052 Computer Architecture Computer Science & Engineering University of Moratuwa Dilum Bandara [email protected]
28
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CPU Performance Enhancements

CPU Performance

Enhancements

CS2052 Computer Architecture

Computer Science & Engineering

University of Moratuwa

Dilum [email protected]

Page 2: CPU Performance Enhancements

Pipelining – It’s Natural!

Laundry example

Amal, Bimal, Chamal, & Dinal

each have one load of clothes

to wash, dry, & fold

Washer takes 30 minutes

Dryer takes 40 minutes

Folder takes 20 minutes

A B C D

2

Page 3: CPU Performance Enhancements

Sequential Laundry

Sequential laundry takes 6 hours for 4 loads

If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

T

a

s

k

O

r

d

e

r

Time

3

Page 4: CPU Performance Enhancements

Pipelined Laundry – Start Work ASAP

Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

T

a

s

k

O

r

d

e

r

Time

30 40 40 40 40 20

4

Page 5: CPU Performance Enhancements

Pipelining Lessons

Pipelining doesn’t reduce

latency of a single task

Improve throughput of entire

workload

Pipeline rate limited by

slowest pipeline stage

Multiple tasks operating

simultaneously

Potential speedup = No pipe

stages

Unbalanced lengths of pipe

stages reduces speedup

Time to fill pipeline & time to

drain/flush it reduces

speedup

A

B

C

D

6 PM 7 8 9

T

a

s

k

O

r

d

e

r

Time

30 40 40 40 40 20

Page 6: CPU Performance Enhancements

6

Source:

http://mail.humber.ca/~paul.mi

chaud/Pipeline.htm

Instruction Level

Parallelism (ILP)

Page 7: CPU Performance Enhancements

CPU Pipelines

7Source: http://en.wikipedia.org/wiki/Classic_RISC_pipeline

5-stage MIPS

pipeline

Page 8: CPU Performance Enhancements

8

Page 9: CPU Performance Enhancements

Pipeline With a Branch Penalty

Due to a Taken Branch

9

Source: http://mail.humber.ca/~paul.michaud/Pipeline.htm

Page 10: CPU Performance Enhancements

Superscalar Architectures

Executes more than 1 instruction during a clock

cycle by simultaneously dispatching multiple

instructions to redundant functional units

10

Source: http://mail.humber.ca/~paul.michaud/Pipeline.htm

Page 11: CPU Performance Enhancements

Intel Hyper Threading (HT)

Introduced with Intel Pentium 4

Allows 2 different resources of CPU to be used at

the same time

While 1st thread (instruction) is working with integers

(ALU’s integer unit) 2nd thread can work on floating

point numbers (ALU’s floating point unit)

OS feels that there are 2 logical CPUs

Achieved through a mix of shared, replicated, &

partitioned chip resources such as:

Registers

Arithmetic units

Cache memory 11

Page 12: CPU Performance Enhancements

Amdahl’s Law

What’s maximum expected improvement to an

overall system when only part of it is improved?

Amdahl said this relationship is not linear

12

Page 13: CPU Performance Enhancements

Amdahl’s Law (Cont.)

13

Best you could ever hope to do

enhanced

maximumFraction - 1

1 Speedup

Page 14: CPU Performance Enhancements

Amdahl’s Law – Example

Floating point instructions improved to run 2X;

but only 10% of actual instructions are FP

14

Speedupoverall =1

0.95= 1.053

ExTimenew = ExTimeold × (0.9 + 0.1/2) = 0.95 × ExTimeold

Page 15: CPU Performance Enhancements

Moore’s Law – Today’s Status

15

Moore’s Law – No of

transistors on a chip

tends to double about

every 2 years

Transistor

count still

rising

Clock speed

flattening

sharply

www.extremetech.com/wp-

content/uploads/2012/02/CPU-Scaling.jpg

Page 16: CPU Performance Enhancements

Dual Core

Introduced by IBM Power4

However, AMD brought it to consumer market

Combines 2 independent CPUs & their

respective caches onto a single silicon chip

Provide better performance improvement than

HT

True parallelism

16

Page 17: CPU Performance Enhancements

Multi-Core

17

Source: www.anandtech.com/show/5174/why-ivy-bridge-is-

still-quad-core

Page 18: CPU Performance Enhancements

Multi-Core (Cont.)

18Source: www.legitreviews.com/intel-core-i7-4770k-haswell-3-5ghz-quad-core-cpu-review_2203

Page 19: CPU Performance Enhancements

Multi-Core (Cont.)

19

Source: www.hardwarecanucks.com/news/cpu/intel-launch-8-core-xeon-nehalemex/

Page 20: CPU Performance Enhancements

Multi-Cores + Hyper Threading

20

Source: www.notebookcheck.net/Intel-Core-i7-Notebook-Processor-Clarksfield.21025.0.html

Page 21: CPU Performance Enhancements

NVIDIA Tesla 2070

Many-Cores

GPUs

Graphic Processing Unit

NVIDIA & ATI

SIMD – Single Instruction Multiple Data

Intel Xeon Phi

General purpose

21

Intel Xeon Phi

Page 22: CPU Performance Enhancements

Example Specifications

22

GTX 480 Tesla 2070 Tesla K80

Peak double

precision FP

performance

650 Gigaflops 515 Gigaflops 2.91 Teraflops

Peak single

precision FP

performance

1.3 Teraflops 1.03 Teraflops 8.74 Teraflops

CUDA cores 480 448 4992

Frequency of CUDA

Cores

1.40 GHz 1.15 GHz 560/875 MHz

Memory size

(GDDR5)

1536 MB 6 GB 24 GB

Memory bandwidth 177.4 GB/sec 150 GB/sec 480 GB/sec

ECC Memory No Yes Yes

Page 23: CPU Performance Enhancements

CPU vs. GPU Architecture

23

GPU devotes more transistors for computation

Page 24: CPU Performance Enhancements

Multithreaded SIMD Processor

24

Source: Computer Architecture by

John L. Hennessy and David A.

Patterson

Page 25: CPU Performance Enhancements

NVIDIA CUDA Architecture

25

Page 26: CPU Performance Enhancements

Intel Xeon Phi

26

Source: www.pcgameshardware.de/Xeon-Phi-Hardware-256199/News/Intel-Xeon-Phi-Hardware-

Informationen-1040924/

Page 27: CPU Performance Enhancements

Intel Xeon Phi (Cont.)

27Source: www.altera.com/technology/system-design/articles/2012/multicore-many-core.html

Page 28: CPU Performance Enhancements

Power Consumption

Dynamic energy

Transistor switch from 0 1 or 1 0

½ × Capacitive load × Voltage2

Dynamic power

½ × Capacitive load × Voltage2 × Frequency switched

Static power consumption

Currentstatic × Voltage

Scales with no of transistors

Reducing voltage reduces energy

Reducing clock rate reduces power, not energy

Power gating than not only taking out clock signal28