Top Banner
[email protected] 2017 1 The CPU Computer Architecture 1DT016 distance Fall 2017 http://xyx.se/1DT016/index.php Per Foyer Mail: [email protected] 1
56

Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Feb 04, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

[email protected] 2017 1

The CPU

Computer Architecture

1DT016 distanceFall 2017

http://xyx.se/1DT016/index.php

Per FoyerMail: [email protected]

1

Page 2: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

[email protected] 2017 2

Where in the machine now?

2

Level 0

Level 1

Level 2

Level 3

Level 4

Level 5

Digital Logic Level

Microprogramminglevel

Conventionalmachine level

Problem-orientedlanguage level

Operating systemmachine level

Assembly languagelevel

addmul: addi $r1, $zero, 2 mul $r1, $r1, 2 jr $ra

int addmul( int t ){ return (t + 2) * 2;}

li $v0, 4syscall

0x24020004 0x0000000c0x03E00008

110110101111010000010110000100010011111010100001

Translation (compiler)

Translation (assembler)

Partial interpretation (OS)

Interpretation(microprogram)

Executed byhardware

Page 3: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Intel 4004

[email protected] 2017 3

MCS-4 (chipset):i4001: ROM (256 bytes)i4002: RAM (40 bytes)i4003: Shift register (10 bits)i4004: CPU (4-bit)

4-bit

Designed by Federico Faggin

Page 4: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Intel 4040

[email protected] 2017 4

Advanced features at the time

4-bit

Page 5: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Intel 8008

[email protected] 2017 5

8-bit

I principle a stretched i4004

Page 6: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Intel 8080

[email protected] 2017 6

8-bit

Page 7: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Zilog Z80

[email protected] 2017 7

8-bit

Page 8: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Zilog Z80 (2)

[email protected] 2017 8

8-bit

Page 9: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

CISC – The early days

Complex Instruction Set Computer

•Primary memory was slow and expensive

•Reduce memory access The more that could be done inside the CPU, the better

•µ-code can be (quite) easily changed Enhance or reduce the ISA, fix machine level bugs

•The more machine instructions avaliable, the easier to write high-level compilers producing ”tight code”.

•Less amount of PM needed to store machine instructions

Gave: Large and writeable µ-stores, and in some cases even nano-code

Example: VAX 11/750 had 303 µ-coded assembler instructions

[email protected] 2017 9

Page 10: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Example: IBM 4341 mainframe

[email protected] 2017 10

1. IPL1 (hardware): Read CPU instruction set from removable media (5 ¼” floppy)

2. IPL2: Read boot firmware from removable media (5 ¼” floppy)

3. IPL3: Use firmware to boot OS loader from disk drive 0.

4. IPL4: Load OS from disk drive x.

CISC Galore!

• CPU machine instructions can be added or removed• Bugs in implementations of CPU machine instructions can be corrected

IPL = Initial Program Loader

Page 11: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

µ-coded CISC Trivia

• It’s almost impossible to design a CPU without bugs on the hardware level (including µ-code)

• Intel had problems with the infamous FDIV (floating point divide) FPU instruction in the Pentium family.

• The affected part was defined in µ-code so the problem was fixed between CPU steppings (hardware revisions) without any hardware redesign.

• It’s so common with CPU bugs that vendors release several erratas for the same type of CPU during it’s life span (but no erratas when EOL)

[email protected] 2017 11

Page 12: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Intel i7 bugs

[email protected] 2017 12

This errata goes on for 14 pages...

Page 13: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Intel 8088 / 8086

[email protected] 2017 13

16-bit

Page 14: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

CISC: MC68000

[email protected] 2017 14

16-bit

Asynchronous bus traffic!

Page 15: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Intel 80286

[email protected] 2017 15

16-bit

Page 16: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Intel 80386 (386DX/SX)

[email protected] 2017 16

32-bit

MMU

Page 17: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Intel 80486 (486DX/SX)

[email protected] 2017 17

32-bit

Page 18: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Intel Pentium (”586”)

[email protected] 2017 18

32-bit

Page 19: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Flynn’s [1] taxonomy

[email protected] 2017 19

Type Instructions Datum [2] Examples

SISD 1 1 Classic vN / Harvard

SIMD 1 Multi Vector processors

MISD Multi 1 Fault tolerant systems

MIMD Multi Multi Multiprocessors

[1] Michael J. Flynn, Stanford university, 1966

[2] Datum may refer to a part of a data set, e.g. in shared memory

Page 20: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Intel Pentium MMX

[email protected] 2017 20

MMX – SIMD3DNOW! SSE

APIC – Advanced ProgrammableInterrupt Controller

32-bit

Page 21: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

A µ-Coded CPU in no timeWhat do we need?

•A Register bank (or a few discrete registers)

•PC and SP registers

•One or two ALUs with a set of ALUops

•A flag register: Z, P, N, C, …

•An Unidirectional MAR (Memory Address Register)

•A Bidirectional MDR (Memory Data Register)

•Outgoing control signals (MREQ, IORQ, RD, WR, …)

•An internal databus and a control bus

•A µ-coded Control Unit (CU)

…and, of course an [email protected] 2017 21

Page 22: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

µControl Unit: Horizontal µ-code

[email protected] 2017 22

• Maximum parallelism, given the number of bits from the µROM

• Many control lines make it easier to modify the ISA

• µROM acts as a sequencer of arbitrary control signals

• Easily expanded by adding parallel ROMs

• Large ROMs costs expensive silicon die space

• Sometimes called Wide µ-code

Page 23: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

µControl Unit: Vertical µ-code

[email protected] 2017 23

• Must carefully group signals together so partial parallelismcan be guaranteed

• Uses less high speed ROM which frees upp space on silicon die

• Less flexibility to enhance the ISA

• Sometimes called narrow µ-code

Page 24: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

µControl Unit with Lookup table

[email protected] 2017 24

• The machine instruction in IR is index to the Lookup ROM• The value in the Lookup ROM is the start address (set in µPC) for

the µ-code corresponding to the machine code in IR

Page 25: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

µ-code: Fetch, Decode and Execute

[email protected] 2017 25

1. PCout, MARin, MARlatch, PCincr

2. MREQout, memRD, memWAIT

3. MDRin, MDRlatch, MDRout,IRin, IRlatch

1. µJTable[ IR[31:26]out ], µPCin

1. IR[20:16]out, REGBANKin, REG-RD,REGBANKout, ALU1in, IR[15:0]out, ALUin2, ALUopADD, µPCincr

2. ALUout, MARin, MARlatch, µPCincr

3. MREQout, memRD, memWAIT

4. MDRin, MDRlatch, MDRout, IR[25:21], REGBANKin, REG-WR, GotoFETCH

lw $t1, 100($t2)

Fetch

Decode

Execute

Page 26: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

The birth of RISC

[email protected] 2017 26

Page 27: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

RISC – The early daysReduced Instruction Set Computer

Motives:

•CISC ISAs often overly complex

•Many CISC instructions are very seldom used

•Analyzing an arbitrary program reveals that it most often is written with just a few number of basic constructs:

• Simple variable or memory assignments• If … then … else (conditional jumps/branches)• Loops• Subroutine / function calls

•How many instructions in the Intel i7 are used less than 0.25%? Is it really worth having them on silicon? - Probably no, but there is another story to this: The need to be compatible with every earlier x86 processor ever made…

[email protected] 2017 27

Page 28: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

RISC – The ideas

• Create a set of a few very carefully choosen machine instructions of a single fixed size

• Only Load and Store instructions refer to memory

• Create optimizing compilers that take full use of these few machine instructions

• Replace µ-code with hardwired control logic

• Reg-to-Reg-operations in one clock cycle

• Complex math instructions co-processor

• Fewer instructions means freed space on silicon that can be used for pipelines, larger register files and caches.

[email protected] 2017 28

Page 29: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

RISC – The ideas (2)

Main goals:

•Make the datapath turnaroud time as short as possible.

•When no more instructions can be removed, the specification of the RISC ISA is finalized.

[email protected] 2017 29

Page 30: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

MIPS R3000

[email protected] 2017 30

32-bit

Microprocessor without Interlocking Pipe Stages

Page 31: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

MIPS single cycle datapath

[email protected] 2017 31

Page 32: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Pipelining Analogy

[email protected] 2017 32

Pipelined laundry: Overlapping execution Parallelism improves performance

Page 33: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

PipeliningDesign the CPU with the overall goal to start a new instruction every clock cycle

Use pipelines for each step of the instruction cycle:

1. Instruction fetch [IF]• Get instruction from program memory

2. Instruction decode [ID]• Translate opcode to control signals and read registers

3. Execute [EX]• Perform ALU operation, calculate branch tagets

4. Memory [MEM] (data)• Access memory if needed (Load/Store)

5. Write back [WB]• Update register file

[email protected] 2017 33

Page 34: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Single-cycle vs. Multicycle vs. Pipelined

[email protected] 2017 34

Page 35: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

MIPS pipelined datapath

[email protected] 2017 35

Page 36: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Time graphs

[email protected] 2017 36

Clock cycle

Latency: 5 cycles Throughput: 1 inst. / cycle Concurrency: 5

Page 37: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Pipelining: Hazards

Situations that prevent starting the next instruction in the next cycle (creating pipeline stalls):

•Structure hazards

• A required resource is busy

•Data hazard

• Need to wait for previous instruction to complete its read/writeadd $s0, $t0, $t1sub $t2, $s0, $t3

•Control hazard

• Deciding on control action depends on previous instruction

[email protected] 2017 37

Page 38: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Pipelining: Data Hazards

Dependencies backward in time cause hazards

Example: Instruction flow – 5 stage pipeline:

lw $1, 4($2)sub $4, $1, $5 # $1 is still in pipelineand $6, $1, $7or $8, $1, $9 # $1 available in stage 4xor $4, $1, $5

”Load-use” data hazard

•May be ”fixed” with a pipeline stall

•…or by inserting NOPs

•…or reordering instructions

[email protected] 2017 38

Page 39: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Pipelining: Structure hazards

• Conflict for use of a resource

• In MIPS pipeline with single memory:

• Load/store requires data access• Instruction fetch would have to stall for that cycle

Would cause a pipeline ”bubble”

• Hence, pipelined datapaths require separate instruction/data memories

• …or separate instruction/data caches

[email protected] 2017 39

Page 40: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Pipelining: Control Hazards

• When the flow of instruction addresses is not sequential (i.e. not PC = PC + 4), due to change of instruction flow

• Unconditional branches (j, jal, jr)• Conditional branches (beq, bne,…)• Exceptions (internal or external interrupts)

• Possible approaches

• Stall (impacts CPI – Clocks Per Instruction)• Move decision point as early in the pipeline as possible

thereby reducing the number of stall cycles• Delay decision (requires compiler support)

• Control hazards occur less frequently than data hazards

• Jumps are very infrequent – only 3% of the instructions ina normal program

[email protected] 2017 40

Page 41: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

(binary executable)

Code reorder (”afterburner”)

[email protected] 2017 41

C / C++, …

gcc –S …

Assemblycode

Reorganizer

ReorderedAssembly code

gas –o …(Assembler)

Object code

ld –o …

Page 42: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

lw $t1, 0($t0) # blw $t2, 4($t0) # elw $t4, 8($t0) # fadd $t3, $t1, $t2 # b + esw $t3, 12($t0) # a add $t5, $t1, $t4 # b + fsw $t5, 16($t0) # c

lw $t1, 0($t0) # blw $t2, 4($t0) # eadd $t3, $t1, $t2 # b + esw $t3, 12($t0) # a lw $t4, 8($t0) # fadd $t5, $t1, $t4 # b + fsw $t5, 16($t0) # c

Code scheduling to avoid stalls

• Reorder code to avoid use of load result in the next instruction

• Example: a = b + e; c = b + f;

[email protected] 2017 42

13 cycles 11 cycles

Stall

Stall

(reordered code)

Page 43: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

ARM v7 core

[email protected] 2017 43

32-bit

3-stage pipeline (F, D, E)

Page 44: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

ARM Cortex CPU Core

[email protected] 2017 44

32 / 64-bit

Page 45: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

MCU with multiple ARM cores

[email protected] 2017 45

Page 46: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Intel Core2 microarchitecture

[email protected] 2017 46

64-bit

Page 47: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

AMD Bulldozer Core

[email protected] 2017 47

64-bit

Page 48: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

PIC controller MCU

[email protected] 2017 48

8-bit

von Neumannor Harvard?

Page 49: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

[email protected] 2017 49

AVR tiny85 MCU

von Neumannor Harvard?

8-bit

2-stage pipeline

Page 50: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

PIC32MX795F512L MCU

[email protected] 2017 50

32-bit

Page 51: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Nvidia Tegra CPU/GPU SoC

[email protected] 2017 51

!!!

Page 52: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

IBM Power8

[email protected] 2017 52

Heavy usage of caches

64-bit

Page 53: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

PowerPC G4 Altivec

[email protected] 2017 53

64-bit

Page 54: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Cray-1 Vector processor

[email protected] 2017 54

64-bit instr.512 bit vectors

Page 55: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Co-processors

Used to take load off main processors

•Floating Point Units (FPU)

•I/O-processors

•Crypto co-processors

•Graphical Processing Units (GPU)

Examples from the 8086 era:

•8087 FPU

•8089 I/O processor

•8288 Bus controller

[email protected] 2017 55

Page 56: Computer Architecture 1DT016-HT2017: The CPUxyx.se/1DT016/sessions/Session3-The-CPU.pdf · CISC – The early days Complex Instruction Set Computer •Primary memory was slow and

Cryptographic Co-processor

[email protected] 2017 56