EC 413 Computer Organization · 2019-12-03 · ADD PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 ALU Overﬂow zero RegWrite Address Write

1

Department of Electrical & Computer Engineering

EC 413Computer Organization

Prof. Michel A. Kinsy

Summary


Computing Devices Then…


Computing Devices Now

2


The Von Neumann Architecture§ Stored Program Computer

Program = A sequence of instructions

Processor

2

114

17

100

0

1

2

3

Addresses Data

MemoryDatatransfer

inst<19:15>inst<24:20>

inst<11:7>

inst<14:12>

Instrcution

ALUControl

RegWriteclk

rd1

GPRs

rs1rs2

wswd rd2

we

ALU

rd ß(rs) func (rt)funct7 rs2 funct3rs1 rd opcode7 5 5 3 5 7

Real/Physical World

Output InputControl


The Von Neumann Architecture§ The modern computer system has three major functional hardware

units: CPU, Main Memory, and Input/Output devices

Processor Memory

Control Bus

211417

100

ReadAddress

Instruction[31-0]

ADD

PC

4

Write Data

Read Addr 1

Read Addr 2

Write AddrRegister File

Read Data 1

Read Data 2

ALU

Overflow

zero

RegWrite

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend

16 32

MemtoReg

ALUSrc

Shiftleft 2

ADD

PCSrc

RegDst

ALUControl

1

1

1

0

00

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Device#1

Device#n

I/O Devices

…

Address Bus

Data Bus

…

External World


The Von Neumann Architecture

§ At the most basic sense, a computer is a device consisting of three units performing three distinctive functions • A processor to interpret and execute programs• A memory to store both data and programs• A mechanism for transferring data to and from the

outside world

Processor

2

114

17

100

0

1

2

3

Addresses Data

MemoryDatatransfer

inst<19:15>inst<24:20>

inst<11:7>

inst<14:12>

Instrcution

ALUControl

RegWriteclk

rd1

GPRs

rs1rs2

wswd rd2

we

ALU


Real/Physical World

Output InputControl

3



• Memory • Stores both program and data

• Control unit• Directs the operations of the other units in the processor by

providing timing and control signals• ALU / Execution Unit(s)

• Performs arithmetic and logical operations such as addition, subtraction, multiplication and division

Processor

2

114

17

100

0

1

2

3

Addresses Data

MemoryDatatransfer

inst<19:15>inst<24:20>

inst<11:7>

inst<14:12>

Instrcution

ALUControl

RegWriteclk

rd1

GPRs

rs1rs2

wswd rd2

we

ALU


Real/Physical World

Output InputControl



• Input • An input device gets data from users• Examples are keyboards, mice, microphones, and secondary

storage devices

• Output• An output device sends data to users• Typical output devices are monitors, printers, and displays

Processor

2

114

17

100

0

1

2

3

Addresses Data

MemoryDatatransfer

inst<19:15>inst<24:20>

inst<11:7>

inst<14:12>

Instrcution

ALUControl

RegWriteclk

rd1

GPRs

rs1rs2

wswd rd2

we

ALU


Real/Physical World

Output InputControl


Taxonomy of ProcessorsProcessor Organizations

Single instruction, single data stream

(SISD)

Uniprocessor

Single instruction multiple data stream (SIMD)

Multiple instruction, single data stream

(MISD)

Multiple instruction, multiple data stream

(MIMD)

Vector Processor Array Processor Shared Memory (Tightly Coupled)

Distributed Memory

(Loosely Coupled

ClusterSymmetricMultiprocessor

(SMP)

NonuniformedMemory Access

(NUMA)

4


Taxonomy of ProcessorsProcessor Organizations

Single instruction, single data stream

(SISD)

Uniprocessor

Single instruction multiple data stream (SIMD)

Multiple instruction, single data stream

(MISD)

Multiple instruction, multiple data stream

(MIMD)

Vector Processor Array Processor Shared Memory (Tightly Coupled)

Distributed Memory

(Loosely Coupled

ClusterSymmetricMultiprocessor

(SMP)

NonuniformedMemory Access

(NUMA)

Covered in detail in this class


Amdahl's Law Revisited § This law answers the critical question:

§ How much of a speedup one can get for a given parallelized task?

§ If s is the fraction of a calculation that is sequential, and (1-s) is the fraction that can be parallelized, then the maximum speed-up that can be achieved by using n processors is

§ Speed-up = 1

s+1− sn


Amdahl's Law Revisited§ If 80% of a calculation can be parallelized, i.e.

20% is sequential, then what is the maximum speed-up which can be achieved on 8 processors? § What if we double the number of processors (n =

16)?§ What if we double the number of processors again

(n = 32)?

§ What if the number of processors is 1000?

5


Amdahl's Law Revisited§ If 50% of a calculation can be parallelized, i.e.

50% is sequential, then what is the maximum speed-up which can be achieved on 8 processors? § What if we double the number of processors (n =

16)?§ What if we double the number of processors again

(n = 32)?

§ What if the number of processors is 1000?


Computing Layered View

Operating System

Programming Language

Applications & Algorithms

Firmware

Datapath & Control

Digital Design

Circuit Design

Layout

I/O systemProcessor Memory organizationISA

Compiler


Bridging/Compiling Process§ High-Level Language

C/C++/Java program

compiler

assembly code

assembler

object code library routines

executable

linker

loader

memory

Human Readable

Machine Code

6


The Big Picture

Cprogram

compiler

assemblycode

assembler

objectcode libraryrou4nes

executable

linker

loader

memory

Processor

2

114

17

100

0

1

2

3

Addresses Data

MemoryDatatransfer

inst<19:15>inst<24:20>

inst<11:7>

inst<14:12>

Instrcution

ALUControl

RegWriteclk

rd1

GPRs

rs1rs2

wswd rd2

we

ALU



Program memory managementHigher

Addresses

Lower Addresses

Text Segment [Program code]

Fixed Size

Data Segment [Initialized global and static variables]

Fixed Size

BSS Segment [Initialized global and static variables]

Fixed Size

Heap Segment [Dynamic variables managed by

malloc(), free(), etc.]Variable Size

Stack Segment [Stack frames consisting of parameters,

return addresses and local variables]

Variable Size

Free spaceTop of the

stack

Bottom of the stack


Stack Structure

Other valueParameter p3Parameter p2Parameter p1Return address

Variable XVariable YVariable Z

Bottom of stackValueValue

…

Lower addresses

Higher addresses

Stack frame of the function

Associated C function code

int function (int p1, p2, p2){

int X, Y, Z; …

}

7


Stack Structure§ Procedure frame or activation record

Bottom of stack

…

Other value

sp

Bottom of stack

…

Other value

sp

Before call

sp

Bottom of stack

…

Other value

Local variablesSaved registers Arg. registers Return address

During call After call


Heap Structure§ The heap is allocated by demand or request

using C memory management functions such as malloc(), memset(), realloc() etc.

§ It allows data (especially arrays) to take on variable sizes

§ It allows locally created variables to live past end of routine

§ This is what permits many structures used in Data Structures and Algorithms


Application Compiling Process§ A compiler is a software program that translates

a human-oriented high-level programming language code into computer-oriented machine language

Compiler

Error messages

TargetProgram

(RISC-V, MIPS, x86,etc.)

Input

Output

SourceProgram

(C, C++, etc.)

8


Application Compiling Process§ Assembly language program (for RISC-V)

§ Machine (object, binary) code (for RISC-V)

assembler

one-to-one

swap:addi sp,sp,-48...mv a5,a1...ld s0,40(sp)addi sp,sp,48jr ra

111111010000 00010 000 00010 0010011000000110000 00010 000 01000 0010011...


Application Compiling Process§ Detailed compilation process

§ More on this later when you take a course on compilers

Scanner(lexicalanalysis)

Parser(syntaxanalysis)

CodeOptimizer

SemanticAnalysis

(IC generator)

CodeGenerator

Symbols&

Attributes Table

High-levellanguage

Targetlanguage

Language-focused transformations

Architecture-focused transformations


Instruction Set Architecture (ISA)§ Instructions are the language the computer

understand§ Instruction Set is the vocabulary of that language§ It serves as the hardware/software interface

§ Defines data types§ byte, int, float, double, string, vector…

§ Defines set of programmer visible state§ Known as the programmer’s model of the machine

§ Defines instruction semantics (operations, sequencing)§ operand location: register, immediate, indirect, . . . § add, sub, mul, move, compare, …

9


§ Many possible implementations of the same ISA§ 360 implementations: model 30 (c. 1964), z900 (c.

2001)§ x86 implementations: 8086 (c. 1978), 80186, 286,

386, 486, Pentium, Pentium Pro, Pentium-4, Core i7, AMD Athlon, AMD Opteron, Transmeta Crusoe, SoftPC

§ MIPS implementations: R2000, R4000, R10000, ...§ JVM: HotSpot, PicoJava, ARM Jazelle, ...§ RISC-V: RV32I, RV32E, RV64I, RV128I, …

§ Open-Source

Instruction Set Architecture (ISA)


Central Processing Unit (CPU)§ Central Processing Unit (CPU)

Organization§ CPU Execution Process

1. Fetch Instruction 2. Decode Instruction 3. Execute Operation 4. Memory Operation 5. Register Writeback Operation

Fetch Instruction

Decode Increment PCRead registers

ALU Operation Or

Branch Address

Data Memory Operation

Write Back


Single Cycle RISC-V CPU

ReadAddress Inst[31-0]

ADD

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register File

ReadData 1

ReadData 2

ALU

Overflow

zero

RegWrite

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend

12 | 20 32

MemtoReg

ALUSrc

Shiftleft 1

ADD

PCSrc

ALUControl

1

1

00

0

1

ALUOp

Instr[30, 14-12]

Instr[19-15]

Instr[24-20]

Instr[11-7]

ControlUnit

Instr[31-21]

Branch

DataMemory

InstructionMemory

Jump

0

1

Instr[31-12]

PC[31-20]

10


Multi-Stage RISC-V CPU

Address

Inst[31-0]

PC

Write Data

Read Addr 1

Read Addr 2


ReadData 1

ReadData 2

ALU

Overflow

zero

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend

12 | 20 32

MemtoReg

ALUSrc

Shiftleft 1

ADD

PCSrc

ALUControl

1

1

00

0

1

ALUOp

ControlUnit

Branch

Memory

RegWrite

RegWrite

Instr[30, 14-12]

Instr[19-15]

Instr[24-20]

Instr[11-7]

Instr[31-21]

ADD

4

0


Fully Bypassed Datapath

ASrcIRIR IR

PC A

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR ALU

ImmExt

rd1

GPRs

rs1rs2

wswdrd2

we

wdata

addr

wdata

rdataData Memory

we

31

nop

D

E M W

PC for JAL, ...

BSrc

Stall Condition


Performance Measurement § Processor performance:

§ Execution time § Area

§ Logic complexity

§ Power

§ In this class we will focus on Execution time

Time = Instructions Cycles TimeProgram Program * Instruction * Cycle

11


Amdahl's Law § By Gene Amdahl§ This law answers the critical question:

§ How much of a speedup one can get for a given architectural improvement/enhancement?§ The performance enhancement possible due to a given

design improvement is limited by the amount that the improved feature is used

§ Performance improvement or speedup due to enhancement E

Execution Time without E Performance with ESpeedup(E) = =

Execution Time with E Performance without E


Processor- Memory Gap§ Performance gap: CPU (55% each year) vs. DRAM (7%

each year)§ Processor operations take of the order of 1 ns§ Memory access requires 10s or even 100s of ns§ Each instruction executed involves at least one memory

access

1990 1980 2000 2010 1

10

10

Rel

ativ

e pe

rform

ance

Calendar year

Processor

Memory

3

6


Memory Technology§ Single-transistor DRAM cell is considerably simpler

than SRAM cell§ This leads to dense, high-capacity DRAM memory

chipsWord line

Capacitor

Bit line

Pass transistor

Word line

Bit line

Compl. bit line

Vcc

(a) DRAM cell (b) Typical SRAM cell DRAM Cell SRAM Cell

12


A Typical Memory Hierarchy

Register File

Instruction Cache Data Cache

L2 Cache

L3 Cache

Main Memory

Disk

Bypass Network

Capacity +Speed -

Speed +Capacity -

Inside the processor


Memory Organization§ A memory cannot be large and fast§ Increasing sizes of cache at each level

§ A hit at a level occurs if that level of the memory contains the data needed by the CPU

§ A miss occurs if the level does not contain the requested data

CPU L1 L2 DRAM


A Typical Memory Hierarchy

L1 Data Cache

L1 Instruction Cache Unified L2

Cache

RF Memory

Memory

Memory

Memory

Multi-ported register file (part of CPU)

Split instruction & data primary caches (on-chip SRAM)

Multiple interleaved memory banks(off-chip DRAM)

Large unified secondary cache (on-chip SRAM)

CPU

13


Multilevel Caches§ Cache is transparent to user (happens

automatically)

CPU CacheMemory

MainMemory

RegFile

WordLine

Data is in the cache fraction h

of the time Go to main 1 – h of the time


Caches§ Local miss rate = misses in cache / accesses to

cache§ Global miss rate = misses in cache / CPU memory

accesses§ Misses per instruction = misses in cache / number

of instructions

CPU L1 L2 DRAM


Address Bit-Field Partitioning§ The address (e.g., 32-bit) issued by the CPU is generally

divided into 3 fields § Tag

§ Serves as the unique identifier for a group of data§ Different regions of memory may be mapped to the same cache

location/block§ The tag is used to differentiate between them

§ Index § It is used to index into the cache structure

§ Block Offset§ The least significant bits are used to determine the exact data word§ If the block size is B then b = log2B bits will be needed in the address

to specify data word

BlockOffsetTag IndexAddress

t bits k bits b bits

14


Direct-Mapped Cache

Tag Data BlockV

=

BlockOffsetTag Index

tk b

t

HIT Data Word or Byte

2k

lines


Caching principles § Cache size (in bytes or words)

§ Total cache capacity § A larger cache can hold more of the program’s

useful data but is more costly and likely to be slower

§ Block or cache-line size § Unit of data transfer between cache and main§ With a larger cache line, more data is brought in

cache with each miss. This can improve the hit rate but also may bring low-utility data in cache


Caching principles § Placement policy

§ Determining where an incoming cache line is stored§ More flexible policies imply higher hardware cost and

may or may not have performance benefits (due to more complex data location)

§ Replacement policy § Determining which of several existing cache blocks

(into which a new cache line can be mapped) should be overwritten

§ Typical policies: choosing a random or the least recently used block

15


Caching Principles § Compulsory misses

§ With on-demand fetching, first access to any item is a miss

§ Capacity misses§ We have to evict some items to make room for others§ This leads to misses that are not incurred with an

infinitely large cache

§ Conflict misses§ The placement scheme may force us to displace useful

items to bring in other items§ This may lead to misses in future


Caching principles § Line width (2W)

§ Too small a value for W causes a lot of main memory accesses

§ Too large a value increases the miss penalty and may tie up cache space with low-utility items that are replaced before being used

§ Set size or associativity (2S)§ Direct mapping (S = 0) is simple and fast§ Greater associativity leads to more complexity, and

thus slower access, but tends to reduce conflict misses


Caching Principles § Cache contains copies of some of Main Memory

§ Those storage locations recently used§ When Main Memory address A is referenced in CPU§ Cache checked for a copy of contents of A

§ If found, cache hit§ Copy used§ No need to access Main Memory

§ If not found, cache miss§ Main Memory accessed to get contents of A§ Copy of contents also loaded into cache

16

Department of Electrical & Computer Engineering 46

Cache Performance Metrics§ Cache miss rate

§ Number of cache misses divided by number of accesses

§ Cache hit time§ Time between sending address and data returning

from cache§ Cache miss latency

§ Time between sending address and data returning from next-level cache/memory

§ Cache miss penalty§ Extra processor stall caused by next-level

cache/memory access


I/O Interface§ Basic I/O hardware

§ Ports, buses, devices and controllers § I/O Software

§ Interrupt Handlers§ Device Driver§ Device-Independent Software§ User-Space I/O Software

§ Three ways to perform I/O operations§ Polling§ Interrupt § Direct Memory Access (DMA)


I/O Services§ In a general-purpose computer, the CPU needs to

interact with I/O devices connected to the computer (e.g., keyboard, display, disk-drives, network, etc.)

§ I/O Devices are connected to the computer through controllers

Memory

System bus

CPU

Controller Controller Controller

17


Processor I/O§ Two approaches

§ Memory-mapped I/O § Devices mapped to

reserved memory locations - like RAM

§ Uses load/store instructions just like accesses to memory

§ Direct-mapped I/O § Special bus line§ Special instructions

Address

CPU

Memory I/O Device

Data

ReadWrite

CPU

MemoryI/O Device

Data

ReadWrite

Address

I/O Port

Memory I/O


Memory-Mapped I/O§ I/O devices are assigned memory locations§ Interactions with the I/O device is done through

memory load and store operations

Memory

System bus

CPU



Memory-Mapped I/O

PhysicalAddress

Space

VirtualAddress

Space

0xFFFF FFFF

0x00FF FFFF

0x0000 0000 0x0000 0000

Display

Disk

Keyboard

NetworkI/O

Controller

I/OController

I/OController

I/OController

18


I/O Memory Mappings

SeparateI/O and Memory

Space

0xFFFFFFFF

0

Memory

I/O ports

Memory-Mapped I/O

Hybrid: Memory-Mapped andSeparate I/O


Memory-Mapped I/O• Instead of having special methods to access

values to be read or written, just get them from memory or put them into memory to be access by device

• The device is connected directly to certain main memory locations

• Two types of information to/from the device§ Status§ Value read/write


I/O Services§ Each device controller is in charge of a particular

device type§ Each device controller has some local buffer§ CPU moves data from/to main memory to/from

device local buffers

Memory

System bus

CPU


19


I/O Services§ Device controller informs the CPU that it has an

operation to execute or it has finished its operation by causing an interrupt§ Interrupts provide a mechanism for devices to gain

the CPU's attention

Memory

System bus

CPU



Interrupts Execution Flow§ I/O devices have unique or shared Interrupt Request

Lines (IRQs)

Memory

System bus

CPU


Interrupt Request Lines (IRQs)



Lines (IRQs)§ Interrupt Request Lines are sent to Programmable

Interrupt Controller (PIC) hardware unit to generate the corresponding interrupt vectors

0

N

Interrupt Request

Lines (IRQs)

… Programmable Interrupt

Controller (PIC)

20





§ Interrupt vectors are sent to the Interrupt Descriptor Table (IDT) to locate the corresponding Interrupt Handler subroutine information

0

N

Interrupt Request Lines (IRQs)


Controller (PIC)

0

255

Handler

Interrupt Descriptor Table (IDT)


Interrupts Execution Flow§ Interrupt Handler subroutine information consists of a

new stack pointer, program counter, and system state§ The new program counter is loaded to PC register for

execution

0

N

Interrupt Request

Lines (IRQs)


Controller (PIC)

0

255

Handler

Interrupt Descriptor Table (IDT)

Interrupt Handler

PC





§ Interrupt vectors are sent to the Interrupt Descriptor Table (IDT) to locate the corresponding Interrupt Handler subroutine information

§ Interrupt Handler subroutine information consists of a new stack pointer, program counter, and system state

§ The new program counter is loaded to PC register for execution

21


Virtual Memory § Virtual memory

§ Technique that allows execution of a program that may not completely reside in memory (RAM)

§ Allows the computer to “fake” a program into believing that its memory space is larger than physical RAM


Virtual Memory § Two memory “spaces”

§ Virtual memory space what the program “sees”§ Physical memory space what the program runs in

(size of RAM)

§ On program startup § OS copies program into RAM § If there is not enough RAM, OS stops copying

program and starts it running with only a portion of the program loaded in RAM


Virtual Memory

bne 0x00

add r10,r1,r2

sub r3,r4,r1

sw r5,0x0c

0x00

0x04

0x08

0x0C

0x10

0x14

0x18 0x1C

0x00 0x04

0x08

0x0C

add r1,r2,r3

sub r2,r3,r4

lw r2, 0x04

mult r3,r4,r5

add r1,r2,r3

sub r2,r3,r4

lw r2, 0x04

mult r3,r4,r5

Virtual Memory Physical Memory

22


Virtual Memory

bne 0x00

add r10,r1,r2

sub r3,r4,r1

sw r5,0x0c

0x00

0x04

0x08

0x0C 0x10

0x14

0x18

0x1C

0x00

0x04

0x08

0x0C

add r1,r2,r3

sub r2,r3,r4

lw r2, 0x04

mult r3,r4,r5

add r1,r2,r3

sub r2,r3,r4

lw r2, 0x04

mult r3,r4,r5

0x00 0x000x04 0x040x08 0x080x0c 0x0c0x10 Disk0x14 Disk0x18 Disk0x1c Disk

Translation TableVirtual Memory Physical Memory


Page Fault

bne 0x00

add r10,r1,r2

sub r3,r4,r1

sw r5,0x0c

0x00

0x04

0x08

0x0C 0x10

0x14

0x18

0x1C

0x00

0x04

0x08

0x0C

add r1,r2,r3

sub r2,r3,r4

lw r2, 0x04

mult r3,r4,r5

bne 0x00

sub r2,r3,r4

lw r2, 0x04

mult r3,r4,r5

0x00 Disk0x04 0x040x08 0x080x0c 0x0c0x10 0x000x14 Disk0x18 Disk0x1c Disk

Translation TableVirtual Memory Physical Memory


Virtualaddress

Program View

0x00

0x04

0x08

0x0C

add r1,r2,r3

sub r2,r3,r4

lw r2, 0x04

mult r3,r4,r5

0x00 0x000x04 0x040x08 0x080x0c 0x0c0x10 Disk0x14 Disk0x18 Disk0x1c Disk

CPU

Instructions(or data)

Translation Table

Virtual Memory Physical Memory

23


The Von Neumann Architecture§ The modern computer system has three major functional hardware

units: CPU, Main Memory, and Input/Output devices

Processor Memory

Control Bus

211417

100

ReadAddress

Instruction[31-0]

ADD

PC

4

Write Data

Read Addr 1

Read Addr 2


Read Data 1

Read Data 2

ALU

Overflow

zero

RegWrite

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend

16 32

MemtoReg

ALUSrc

Shiftleft 2

ADD

PCSrc

RegDst

ALUControl

1

1

1

0

00

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Device#1

Device#n

I/O Devices

…

Address Bus

Data Bus

…

External World


BU EC 413 Fall of 2019

EC 413 Computer Organization · 2019-12-03 · ADD PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Read Data 2 ALU Overﬂow zero RegWrite Address Write

Documents