Computer architecture short note (version 8)

1 | P a g e

Computer Architecture IN 2320

Lesson 02 – Introduction

Computer architecture:

Deals with the functional behavior of a computer system as viewed by a programmer

Ex: the size of a data type –32 bits to an integer

Computer organization:

Deals with structural relationships that are not visible to the programmer

Ex: clock frequency or the size of the physical memory

Levels of a computer:

1. User Level: Application Programs (HIGH LEVEL)

2. High level languages

3. Assembly Language/ Machine Code

4. Microprogrammed/ Hardwired Control

5. Functional Units (Memory, ALU, etc)

6. Logic Gates

7. Transistors and Wires (LOW LEVEL)

Computer Architecture-Definition:

The attributes of the computer system that are visible to programmers i.e. the attributes of the

computer system that have a direct impact on the logical execution of a program

Ex: the instruction set, the size of a data type, techniques of addressing the memory

EX: Architectural issue is whether a computer will have a multiply instruction

Computer Organization-Definition:

The operational units and their interconnection that realize the architectural specifications

Ex: control signals, interface between computer and peripherals, memory technology used

Ex: Organizational issue is whether multiply instruction is implemented using the a separate cct

or whether it is implemented using the repeated use of adder cct.

Organozational decision may based on the several parameters such as anticipated frequency of

the use of multiply instruction.

2 | P a g e

Forces on Computer Architecture:

Technology

Programming Languages

Applications

OS

History

The Computer Architect’s view:

Architect is concerned with design & performance

Designs the ISA for optimum programming utility and optimum performance of implementation

Designs the hardware for best implementation of the instructions

Uses performance measurement tools, such as benchmark programs, to see that goals are met

Balances performance of building blocks such as CPU, memory, I/O devices, and

interconnections

Meets performance goals at lowest cost

Factors involving when selecting a better computer are;

1. COST factors

a. Cost of hardware design

b. Cost of software design (OS, applications)

c. Cost of manufacture

d. Cost of end purchaser

2. PERFORMANCE factors

a. What programs will be run?

b. How frequently will they be run?

c. How big are the programs?

d. How many users?

e. How sophisticated are the users (User level)?

f. What I/O devices are necessary?

g. There are two ways to make computers go faster.

i. Wait sometime (year). Implement in a faster/better/newer technology.

1. More transistors will fit on a single chip.

2. More pins can be placed around the IC.

3. The process used will have electronic devices (transistors) that switch

faster.

ii. New/innovative architectures and architectural features, and clever

implementations of existing architectures.

3 | P a g e

Higher Computer performance may involve one or more of the following:

Short response time for a given piece or work

o The total time taken by a functional unit to respond to a request for service

o Functional unit/ execution unit is a part of CPU that performs the operations and

calculations as instructed by a computer.

High throughput (rate or processing work)

o Rate at which something can be processed

Low utilization of computing resources

o System resources(practical): physical or virtual entities of limited availability

Ex: memory, processing capacity, network speed

o Computational resources(abstract): resources used for solving a computational problem

Ex: computational time, memory space

Fast data compression and decompression

High bandwidth

Short data transmission time

*note red coloured performance factors are the area of interest.

Throughput:

if(no overlap or if no parallelism)

throughput = 1/average response time

else

throughput > 1/average response time

//the number of parallel processing computers are also important

Elapsed time/response time:

Elapsed time = Response time = CPU time + I/O wait time

CPU time = time spent running a program

Performance= 1/response time

Since we are more concerned about CPU time,

Performance = 1/CPU time

*note Improve Performance

1. Faster the CPU

Helps to improve both response time and throughput

2. Add more CPUs

Helps to improve throughput and perhaps response time due to less queuing

4 | P a g e

*Note: Selection is depend on what is important to whom, i.e. cost factors and performance factors

Ex 01: Computer system user

Goal: Minimize elapsed time for program=time_end-time_start

Called response time (counted in ms)

Ex 02: Computer Center Manager

Goal: Maximize completion rate = no. of jobs per second

Called throughput (counted per sec)

Factors driving architecture:

Effective use of new technology

Can a desired performance improvement

Performance Metrics

Values derived from some fundamental measurements:

Count of how many times an event occurs

Duration of a time interval

Size of some parameter

Some basic metrics include:

Response time

o Elapse time from request to response

o Elapsed time = Response time = CPU time + I/O wait time

CPU time = time spent running a program

Performance time= 1/response time

Since we are more concerned about CPU time,

Performance time= 1/CPU time

o CPU time is affected by;

Number of instructions in the program

Average number of clock cycles to complete one instruction

Clock cycle time

Throughput

o Jobs or operations completed per unit of time

Bandwidth

o Bits per second

Resource utilization

5 | P a g e

Standard benchmark metrics

SPEC

TCP

Characteristics of good metrics:

Linear

o Proportional to the actual system performance

Reliable

o Larger value -> better performance

Repeatable

o Deterministic when measured

Consistent

o Units and definition constant across systems

Independent

o Independent from influence of vendors

Easy to measure

Some examples of Standard Metrics:

MIPS

MFLOPS, GFLOPS, TFLOPS, PFLOPS

SPEC metrics

TCP metrics

Parameters of Performance Metrics:

Clock rate (=1/Clock cycle time)

Instructions per program (I/P)

Average clock cycles per instruction (CPI)

Service time

Interarrival time (time between arrivals of successive requests)

Number of users

Think time

*note Execution time (CPU time, runtime) = I/P * CPI * clock cycle time <= Iron Law

All the three factors are combined to affect the metric Execution time.

I/P -> depend on compiler

CPI -> depend on CPU design/organization

Clock cycle time -> processor architecture

6 | P a g e

Ex01:

Our program takes 10s to run on computer A, which has 400 MHz clock. We want it to run in 6s. The

designer says that the clock rate can be increased, but it will cause the total number of clock cycles for

the program to increase to 1.2 times the previous value. What is the minimum clock rate required to get

the desired speedup?

Answer:

Old Machine A New Machine A

Runtime 10s 6s

Clock Rate 400Hz CR

Let Total number of clock cycles per program in old machine A = x

Since clock cycles per program = Clock Rate * Runtime

x = 400 * 10 = 4000 cycles

Total number of clock cycles per program in new machine A = 1.2 x

= 1.2 * 4000

= 6 * CR

6 * CR = 1.2 * 4000

CR = 800Hz

Workload:

A test case for the system

Benchmark:

A set of workload which together is representative of ‘my program’ should be reproducible.

Ex02:

Which is faster? A or B?

Test Case Machine A Machine B

1 1s 10s

2 100s 10s

Assume Test Case 1 type processes happen 99% of the time

7 | P a g e

Answer:

We have to obtain the weighted average of runtime.

Weighted average for A = 1(99)+ 100(1)

100 = 1.99 s <= answer is A

Weighted average for B = 10(99)+ 10(1)

100 = 10 s

*note

Cost of improving the processor is high. But if you find that you are needed a particular circuit 99% of

the time (ex: multiplication instruction), then you can improve that circuit from 2, 3 factors. You will

improve the performance as a whole that way.

Performance comparison

Performance = 1

𝑡𝑖𝑚𝑒

There are 2 machines A and B.

Performance(A) = 1

𝑡𝑖𝑚𝑒(𝐴)

Performance(B) =1

𝑡𝑖𝑚𝑒(𝐵)

Therefore;

Performance(A)

Performance(B) =


𝑡𝑖𝑚𝑒(𝐴) = 1 +

𝑥

100 iff A is x% faster than B

Ex03:

time(B) = 10s, time(B) = 15s

Performance(A)

Performance(B) =


𝑡𝑖𝑚𝑒(𝐴) =

15

10 = 1.5 = 1 +

50

100 i.e. A is 50% faster than B

Breaking down performance:

A program is broken into instructions.

o Hardware is aware of instructions, not programs.

At lower level, hardware breaks into instructions into cycles.

o Lowe level state machines change state every cycle

For example 500MHz P-III runs 500M cycles/sec, 1 cycle = 2ns

8 | P a g e

Iron Law

Processor time = 𝑇𝑖𝑚𝑒

𝑃𝑟𝑜𝑔𝑟𝑎𝑚 =

𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠

𝑃𝑟𝑜𝑔𝑟𝑎𝑚 *

𝐶𝑦𝑐𝑙𝑒𝑠

𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 *

𝑇𝑖𝑚𝑒

𝐶𝑦𝑐𝑙𝑒


𝑃𝑟𝑜𝑔𝑟𝑎𝑚



𝑇𝑖𝑚𝑒


(Code size) (CPI) (Cycle time)

Architecture Implementation Realization Compiler Designer Processor Designer Chip Designer

Instructions executed, not static code size

Determined by algorithm, compiler, ISA

Determined by ISA and CPU organization

Overlap among instructions reduces this term

Average number of clock cycles to complete one instruction

Determined by technology, organization, clever circuit design

Ex04:

Machine A: clock 1ns, CPL 2.0, for program X

Machine B: clock 2ns, CPL 1.2, for program X

Which is faster and how much?

Time(A)= I/P * CPI * Clock cycle time = I/P * 2.0 * 1 =2 I/P

Time(B)= I/P * CPI * Clock cycle time = I/P * 1.2 * 2 =2.4 I/P

𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛 = 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒(𝐴)

𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒(𝐵)=

𝑇𝑖𝑚𝑒(𝐵)

𝑇𝑖𝑚𝑒(𝐴)

Performance comparison = 2.4 I/P

2 I/P = 1.2 = 1 +

20

100 = machine A is 20% faster than machine B

Ex05:

Keep clock(A) at 1ns and clock(B) at 2ns.

For equal performance, if CPI(B) = 1.2, what is CPI(A)?

Time(A)= I/P * CPI * Clock cycle time = I/P * CPI(A) * 1 =CPI(A)* I/P

Time(B)= I/P * CPI * Clock cycle time = I/P * 1.2 * 2 =2.4 I/P

𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛 = 𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒(𝐴)

𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒(𝐵)=

𝑇𝑖𝑚𝑒(𝐵)

𝑇𝑖𝑚𝑒(𝐴)

Performance comparison = 2.4 I/P

CPI(A)∗ I/P =

2.4

CPI(A) = 1

CPI(A) = 2.4

9 | P a g e

Other Metrics MIPS: Million Instructions Per Second

MFLOPS: Million FLOating point operations Per Second

GFLOPS: Giga FLOating point operations Per Second

Since floating point numbers contain 3 parts including sign, mantissa, and exponent. It takes more time

than an integer. i.e. floating point numbers take more cycles per instruction. Therefore we take the

worst case as metrics.

The common case differs from application to application. Difference can be significant if a program

relies predominantly on integers, as opposed to floating point operations.

Ex06:

Without floating point (FP) hardware, an FP operation may take 50 single cycle instructions. With FP

hardware, only one 2 cycle instructions,

Thus adding FP hardware:

CPI increases.

Instructions /program decreases.

Total execution time decreases.

without FP hardware with FP hardware

I/P 50 1

CPI 1 2

Instruction Set Architecture (ISA) changes => CPI changes

Compiler design also had been changed => I/P changes

Since no change to clock rate, clock cycle time remains the same.

CPU Time = I/P * CPI * Clock cycle time

CPU Time without FP hardware = 50 * 1 * Clock cycle time

CPU Time with FP hardware = 1 * 2 * Clock cycle time

CPU Time with FP hardware < CPU Time without FP hardware

10 | P a g e

Average If programs run equally:

𝐴𝑟𝑖𝑡ℎ𝑚𝑒𝑡𝑖𝑐 𝑚𝑒𝑎𝑛 = 1

𝑛∑ 𝑡𝑖𝑚𝑒(𝑡)

𝑛

𝑡=1

If the programs run in different proportions:

𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝐴𝑟𝑖𝑡ℎ𝑚𝑒𝑡𝑖𝑐 𝑚𝑒𝑎𝑛 = ∑ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑡) x 𝑡𝑖𝑚𝑒(𝑡)𝑛

𝑡=1

∑ 𝑤𝑒𝑖𝑔ℎ𝑡(𝑡) 𝑛𝑡=1

Ex07:

Machine A CPU time Machine B CPU time

Program 1 1ns 10ns

Program 2 1000ns 100ns

What is the fastest computer?

If programs run equally:

Mean CPU time of A = 1+1000

2 =

1001

2 = 500.5ns

Mean CPU time of B = 10+100

2 =

110

2 = 55ns

Machine B is the fastest.

If program type 1 run 90% of the time and program type 2 run 10% of the time:

Mean CPU time of A = 1 x 90 +1000 x 10

100 =

10090

100 = 100.9ns

Mean CPU time of B = 10 x 90 +100 x 10

100 =

1900

100 = 19ns

Machine B is the fastest.

11 | P a g e

Amdahl’s Law Improving the most affected component in a large factor is better than improving everything by a small

factor. i.e. Speedup the common case!

Speed-up of a computer:

The definition of the overall/ final speed-up is given below.

𝑆𝑝𝑒𝑒𝑑 𝑢𝑝 = 𝑜𝑙𝑑 𝑡𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛

𝑛𝑒𝑤 𝑡𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛

If you have improved the performance, some parts will work in less time; speed-up>1 otherwise you

have not improved.

According to Amdahl’s law, we do not try to improve the whole processor at once; therefore we select a

particular part and improve it.

Ex08:

75% of a program of 40ns was improved. Therefore 75% of program works according to the new time

and 25% of the program works according to the old time.

Before improving the above mentioned 75% of instructions were executed in 5ns. After the

improvement, that type of instructions is executed in 1ns. Old time taken to execute a program was

40ns

Assuming that improvement is done only to a fraction f in program, and speed-up of that fraction f = 5𝑛𝑠

1𝑛𝑠

i.e. speed-up of that fraction f = s = 5

new time taken = (1 - f) x old time taken + f x 𝑜𝑙𝑑 𝑡𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛

𝑠

new time taken = (1-0.75)x 40 + 0.75 x 40/5 = 16ns

Overall speed up =𝑜𝑙𝑑 𝑡𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛

(1−𝑓) x 𝑜𝑙𝑑 𝑡𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛 + 𝑓 x 𝑜𝑙𝑑 𝑡𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛

𝑠

= 1

(1−𝑓)+ 𝑓

𝑠

Amdahl’s Law:

𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑠𝑝𝑒𝑒𝑑 𝑢𝑝 = 1

(1 − 𝑓) + 𝑓𝑠

“Speed up the common case.”

Amdahl’s Law Limit:

Maximum Overall speed up = lim𝑠→∞

1

(1−𝑓)+ 𝑓

𝑠

= 1

1−𝑓

12 | P a g e

Figure 1: Amdahl's Law limit

If (1 – f) is nontrivial (extremely difficult and time consuming) speed up is limited.

If a program is highly sequential, there is no any solution other than increase the speedup of a fraction

of the program.

If parallel, we have the additional option to increase the parallelism.

*note

The performance enhancement possible with a given improvement is limited by the amount the

improved feature is used.

Ex: To make a significant impact on the CPI, identify the instructions that occur more frequently and

optimize the design for them.

Ex09:

Program runs In 100s multiplies 80% of the time. Designer M can improve the speed-up of multiply

operations. Now I am a user and I need to make My program 5 times faster. How much speed-up should

M achieve to allow me to reach my overall speed-up goal?

First we need to check whether we can achieve this speed up practically. So let us find the maximum

speed up that we can achieve by f of 80%.

Maximum speed up that we can achieve by f of 80% = 1

1−0.8 =

1

0.2 = 5

We can achieve an overall speed up of 5 if we give an infinite speed up for multiplication instruction. i.e.

s → ∞

The designer M was asked to improve the overall speed up to 5. Theoretically we proved that maximum

overall speed up is also 5. Normally practical maximum speed up is always less than the theoretical

maximum speed up. Therefore this goal cannot be achieved by designer M.

13 | P a g e

Ex10:

Usage frequency and the cycles per operations were given below.

Operation Frequency Cycles per operation

ALU 42% 1

Load 21% 1

Store 12% 2

Branch 24% 2

Assume stores can execute in 1 cycle by slowing clock by 15%. Is it worth implementing this?

Execution time (CPU time, runtime) = I/P * CPI * clock cycle time

CPI = Average number of instructions per cycle

Old CPI = 42 x 1+21 x 1+12 x 2+24 x 2

100 = 1.36

New CPI = 42 x 1+21 x 1+12 x 1+24 x 2

100 = 1.24

Let Old clock cycle time = x

Since clock will be slowed down by 15%, the clock cycle time will be increase by 15% due to inverse

relationship. (clock rate = 1 / clock cycle time)

Therefore New clock cycle time = 1.15x

Since the architecture of the compiler remains constant, the I/P is constant.

Old machine New machine

I/P I/P I/P

CPI 1.36 1.24

Clock cycle time x 1.15x

Old CPU time = I/P x 1.36 x x

New CPU time = I/P x 1.24 x 1.15x

Speed up = I/P x 1.36 x 𝑥

I/P x 1.24 x 1.15𝑥 =

1.36

1.24 x 1.15 = 0.95

The speed up < 1

Old CPU time

New CPU time < 1

Old CPU time < New CPU time

This implementation is not worth doing.

14 | P a g e

Generations of Computer

Vacuum tube

Transistor

Small scale IC

Medium scale IC

Large scale IC

Very large scale IC

Ultra large scale IC

AI

Moore’s Law The observation that, over the history of computing hardware, the number of transistors in a dense IC

has doubled approximately every two years

Figure 2: CPU Transistor Counts 1971 - 2008 and Moore's Law

Consequences:

Higher packing density means shorter electrical paths, giving higher performance in speed

Smaller size gives increased flexibility

Reduced power and cooling requirements

Fewer interconnections increases reliability

Cost of a chip has remained almost unchanged.

15 | P a g e

Requirements changed over time:

Image processing

Speech recognition

Video conferencing

Multimedia authoring

Voice and video annotation files

Simulation modeling

Ways to speeding up the processor:

Pipelining

On board cache

On board L1 and L2 cache

Brand prediction

Data flow analysis

Speculative execution

Performance mismatch:

Processor speed increases

Memory capacity increases

But memory speed always lags behind processor speed

Figure 3: DRAM (Main Memory) and Processor characteristics

16 | P a g e

Solutions:

Increase number of bits retrieved at one time

o Make DRAM wider rather than deeper

Change DRAM interface

o Cache

Reduce frequency of memory access

o More complex cache and cache on chip

Increase interconnection bandwidth

o High speed buses

o Hierarchy of buses

Final Computer Performance is measured in CPU time:

CPU time = 𝑇𝑖𝑚𝑒

𝑃𝑟𝑜𝑔𝑟𝑎𝑚 =


𝑃𝑟𝑜𝑔𝑟𝑎𝑚 *


𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 *

𝑇𝑖𝑚𝑒


instruction Count CPI Clock Rate

Program x

Compiler x (x)

instruction set x x

Organization x x

Technology x

17 | P a g e

Lesson 03 – Computer Memory A memory unit is a collection of storage cells together with necessary circuits for information transfer in

and out of storage.

Memory stores binary information in groups of bits called words.

A word in a memory is a fundamental unit of information in a memory.

Hold series of 1’s and 0’s

Represent numbers, instruction codes, characters etc.

A group of 8 bits called as a byte which is fundamental unit of metric.

Usually a word is a multiple of bytes.

Classification of memory due to key Characteristics:

Location

o CPU (registers)

o Internal (Main Memory/ RAM)

o External (backing storage)

Capacity

o Word size (The natural unit of organization)

o Number of words (Or bytes)

Unit of transfer

o Internal (depend on the bus width)

o External (memory block)

o Addressable unit (smallest unit which can be uniquely addressed, word internally)

Access method

o Sequential (ex: tape)

o Direct (ex: disk)

o Random (ex: RAM)

o Associative (ex: cache, within words)

Performance

o Access time (latency)

o Memory cycle time

o Transfer rate

Physical type

o Semiconductor (ex: SRAM, caches)

o Magnetic (ex: disk and tape)

o Optical (CD and DVD)

o Others (ex: bubble)

18 | P a g e

Physical characteristics

o Decay (leak charges in capacitors in DRAM)

o Volatility

o Erasable

o Power consumption

Organization

Figure 4: Memory Hierarchy

Classification of memory due to key Characteristics:

Location

Whether memory is internal or external to the computer

Internal memory:

Often refers to the Main Memory

But there are other types of internal memory too which are associate with the processor

o Register memory

o Cache memory

External memory:

Refers to peripheral storage devices, such as disk and tape

Accessible to the processor via I/O controllers

19 | P a g e

Capacity

Internal memory:

Measured in terms of bytes or words

Order of 1, 2, 4, 8 bytes

External memory:

Measured in terms of hundreds of Mega bytes or Giga bytes

Unit of transfer

Internal memory:

Refers to the number of data lines into and out of the memory module

This may equal to the word length, but is often larger 128, 256 bits

Concepts related to internal memory:

Word

o Natural unit of organization of memory

o The size of the word is typically equal to the number of bits used to represent a number

and to the instruction length. But there are exceptional cases too.

Addressable units

o Refers to the location which can be uniquely addressed

o In some systems addressable unit is the word.

o Many systems allow addressing at byte level.

o In any case relationship between the length in bits A of an address and the number N of

addressable units is 2A = N, range of addressable units 0 to (2A – 1)

External memory:

Data are often transferred in much larger units than a word, and these are referred to as blocks.

Access method

Methods of accessing units of data

Sequential access:

Memory is organized into units of data called records.

Access must be made in specific linear sequence.

Each intermediate record from current location to the desired location should be passed and

rejected.

Time to access arbitrary record is highly variable depending on the location of the data and

previous location of the reading header.

Ex: tape

20 | P a g e

Direct access:

Individual blocks or records have a unique address based on physical location.

Access is accomplished by direct access to reach a vicinity plus sequential searching to reach the

final location.

Access time is variable.

Ex: Disk units

Random access:

Each addressable location in the memory is unique, physically wired in addressing mechanism.

The time to access a given location is independent of the sequence of prior accesses and is

constant.

Ex: Main memory, some cache systems

Associative access:

This is a random access type of memory that enables one to make a comparison of desired bit

locations within a word for a specific match, and to do this for all words simultaneously.

Thus a word is retrieved based on a portion of its contents rather than its address.

This is a very high speed searching kind of a memory access.

Ex: cache

Performance

Capacity and performance are the most important characteristics for a user

Access time (latency):

For Random Access memory

Time takes to perform a read or write operation, i.e. time from the instant that an

address is presented to the memory to the instant that data have been stored or made

available for use

For non Random Access memory

Time takes to position the read-write mechanism at the desired location

Memory Cycle time:

Primarily applied for random access memory

Memory cycle time = Access time + Time required before a second success can commence

Time required before a second success can commence is the time taken to recover.

Memory cycle time is concerned with the system bus, not the processor.

21 | P a g e

Transfer rate:

The rate at which data can be transferred into and out of a memory unit

For random access memory

𝑇𝑟𝑎𝑛𝑠𝑓𝑒𝑟 𝑡𝑖𝑚𝑒 =1

𝐶𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒

For non random access memory

𝑇_𝑁 = 𝑇_𝐴 + 𝑁

𝑅

T_N = Average time to read or write N bits

T_A = Average Access time

N = Number of bits

R = Transfer rate in bps

Memory Access time

𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑎𝑐𝑐𝑒𝑠𝑠 𝑡𝑖𝑚𝑒 (𝑇𝑠) = 𝐻 ∗ 𝑇1 + (1 − 𝐻) ∗ (𝑇1 + 𝑇2)

𝐴𝑐𝑐𝑒𝑠𝑠 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦 = 𝑇1

𝑇2

H = fraction of all memory accesses that are found in the faster memory (Hit ratio)

T1 = access time to level 1

T2 = access time to level 2

L2 (Main memory)

L1 (Cache)

CPU

22 | P a g e

Ex01:

Suppose a processor has two levels of memory. Level 1 contains 1000 words and has access time of

0.01μs. Level 2 contains 100000 words and has access time of 0.1 μs. Level 1 can directly access. If it in

level 2, then word first transferred into level 1 and then access by the processor.

For simplicity we ignore the time taken to processor to determine whether the word is in level 1 or 2.

For high percentage of level 1 access, the average access time is much closer to that of level 1 than level

2.

Suppose we have 95% of the memory access found in Level 1

Ts = 0.95 * 0.01 µs + (1 - 0.95) * (0.01 µs + 0.1 µs)

Ts= 0.015 µs

Locality of reference

Also known as the principle of locality, the phenomenon in which the same values or related storage

locations are frequently accessed.

Two basic types of reference locality

1. Temporal coherence

There is a higher probability of repeated access to any data item that has been accessed

in the recent past.

Ex: for loop

2. Spatial coherence

There is a higher probability of access to any data item that is physically closer to any

other data item that has been access in the recent past.

Ex: arrays

Physical Characteristics

Figure 5: Memory Hierarchy list and how physical characteristics differ accordingly.

23 | P a g e

Semiconductor Memory Basic element is the cell.

Cell is able to be in one of the two states:

1. Read

2. Write

Random Access Memory (RAM)

Dynamic RAM (DRAM) Static RAM (SRAM)

Bits stored as charge in capacitors Bits stored as on/ off switches (use 6 transistors)

Charges leak No charges to leak

Need refreshing even when powered No refreshing needed when powered

Simpler construction More complex construction

Smaller per bit Larger per bit

Less expensive More expensive

Need refresh circuits No need refresh circuits

Slower Faster

Ex: Main Memory Ex: Cache

It is possible to build a computer which uses only SRAM. But there are problems

This would be very fast

This would need no cache

This would cost a very large amount

DRAM Organization in details

There are many ways that a DRAM (Main memory) could be organized.

Ex02:

List few ways how a 16Mbit DRAM can be organized.

16 chips of 1Mbit cells in parallel, so that 1bit of each word in 1chip. i.e. word size is 16bit

=> 1M x 16

4 chips of 4Mbit cells in parallel, so that 1bit of each word in 1chip. i.e. word size is 4bit

=> 4M x 4

Typical 16Mbit DRAM (4M x 4):

2048 x 2048 x 4bit array

24 | P a g e

Cache memory Take bunch of Main Memory blocks asked by CPU and make a copy of them available to CPU in a faster

manner. If requested address is already available within cache, it is a “hit”.

What happens when CPU requests for a main memory address?

If the address is available in cache Content inside the address is presented to CPU

Else Search for the address in Main memory If cache is having enough space the new block

The new block is stored in cache Else An existing block in cache is replaced by the new block Content is presented to CPU

Figure 6: Performance of accesses involving only

𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑎𝑐𝑐𝑒𝑠𝑠 𝑡𝑖𝑚𝑒 (𝑇𝑠) = 𝐻 ∗ 𝑇1 + (1 − 𝐻) ∗ (𝑇1 + 𝑇2)

When H ->1, the Average access time(Ts) = T1

When H ->0, the Average access time(Ts) = T1 + T2

Cache:

Small amount of fast memory

Sits between normal main memory and CPU

May be located on CPU chip or module

25 | P a g e

Figure 7: Cache memory unit of transfer

Overview of the Cache Design:

Size

o Cost

More cache is expensive

o Speed

More cache is fast, but up to a point only

Checking cache for main memory addresses takes time

Mapping Function

o Direct mapping

o Associative mapping

o Set associative mapping

Replacement Algorithms

o LRU

o FIFO

o LRY

o Random

Write Policy

o Write through

o Write back

Block size

Number of caches

26 | P a g e

Mapping function

Figure 8: CPU, Cache, Cache lines and Main Memory

Cache line:

Each and every individual block in Cache memory is directly connected to CPU without any

barriers. CPU accesses these blocks using Cache lines.

Mapping:

Size(Block of Main Memory) = Size(Block of Cache Memory)

Which Main Memory block maps to which Cache memory Block

1. Direct Mapping Each block of main memory maps to only one cache line.

Example:

Figure 9: A system with 64KB cache and 16 MB Memory

Assume Block size is 4 words, 1 byte per 1 word, size of a block is 4 bytes

𝑆𝑖𝑧𝑒 𝑜𝑓 𝐶𝑎𝑐ℎ𝑒 𝑚𝑒𝑚𝑜𝑟𝑦 = 64 𝐾𝐵

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐵𝑙𝑜𝑐𝑘𝑠 𝑖𝑛 𝑡ℎ𝑒 𝐶𝑎𝑐ℎ𝑒 𝑚𝑒𝑚𝑜𝑟𝑦 =64 𝐾𝐵

4 𝐵= 16 𝐾

27 | P a g e

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑎𝑐ℎ𝑒 𝑙𝑖𝑛𝑒𝑠 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑙𝑜𝑐𝑘𝑠 𝑖𝑛 𝐶𝑎𝑐ℎ𝑒 = 16 𝐾 = 24 𝑥 210 = 214

Therefore we need 14 bits to identify a Cache line or Cache memory block.

𝐴𝑑𝑑𝑟𝑒𝑠𝑠 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑎 𝐶𝑎𝑐ℎ𝑒 𝑚𝑒𝑚𝑜𝑟𝑦 𝑏𝑙𝑜𝑐𝑘 = 14

𝑆𝑖𝑧𝑒 𝑜𝑓 𝑀𝑎𝑖𝑛 𝑚𝑒𝑚𝑜𝑟𝑦 = 16 𝑀𝐵

𝑆𝑖𝑧𝑒 𝑜𝑓 𝑎 𝑀𝑎𝑖𝑛 𝑚𝑒𝑚𝑜𝑟𝑦 𝑤𝑜𝑟𝑑 = 1 𝐵

𝐴𝑑𝑑𝑟𝑒𝑠𝑠 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑎 𝑀𝑎𝑖𝑛 𝑀𝑒𝑚𝑜𝑟𝑦 𝑤𝑜𝑟𝑑 =16 𝑀𝐵

1 𝐵= 16 𝑀 = 24 𝑥 210 𝑥 210 = 224

Therefore we need 24 bits to identify a Main memory byte or word.

Since addresses are divided into groups of 4 words,

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐵𝑙𝑜𝑐𝑘𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑀𝑎𝑖𝑛 𝑚𝑒𝑚𝑜𝑟𝑦 =16 𝑀𝐵

4 𝐵= 4 𝑀

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑙𝑜𝑐𝑘𝑠 𝑖𝑛 𝑀𝑎𝑖𝑛 𝑀𝑒𝑚𝑜𝑟𝑦 = 4 𝑀 = 22 𝑥 210 𝑥 210 = 222

Therefore we need 22 bits to identify a Main memory block.

𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝐵𝑙𝑜𝑐𝑘 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑎 𝑀𝑎𝑖𝑛 𝑚𝑒𝑚𝑜𝑟𝑦 𝑎𝑑𝑑𝑟𝑒𝑠𝑠 = 22

The combinations of remaining 2 bits are used to identify 4 words belongs to a given Main memory

block.

Figure 10: Main Memory Address and its three main components

Cache line number is also equal to the cache block number.

28 | P a g e

Graphical approach

*note

Green Colours represent cache lines.

Blue colours represent Tags.

Blue + Green represent the Main Memory Block number.

Figure 11: Direct Mapping

When CPU is asking for a main memory address 000000010000000000000010,

First it checks at cache line connects to 00000000000000 address.

If the address is not null (or it have something in it), it checks for the tag whether the block in cache line

matches with the tag 00000001.

If matches it returns the word with 10 as the last two bits of the address.,

else the current block is replaced by the required Main memory block.

Else If the address is null, load the required Main Memory block to the Cache.

Exercise:

Find the cache line and tag of the following Main Memory address with all the above assumptions and

conditions.

000010010011000001001011

Answer

Cache line number: 00110000010010 Tag: 00001001

29 | P a g e

Figure 12: Direct Mapping Process

Mathematical Approach

𝑖 = 𝑗 𝑚𝑜𝑑 𝑚

Where 𝑖 = 𝑐𝑎𝑐ℎ𝑒 𝑙𝑖𝑛𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 , 𝑗 = 𝑚𝑎𝑖𝑛 𝑚𝑒𝑚𝑜𝑟𝑦 𝑏𝑙𝑜𝑐𝑘 𝑛𝑢𝑚𝑏𝑒𝑟 and 𝑚 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑎𝑐ℎ𝑒 𝑙𝑖𝑛𝑒𝑠

Figure 13: Direct mapping function

30 | P a g e

Main memory blocks will map to cache blocks sequentially, but after block 7, block 8 has no block to

map in cache, therefore it will again starts to map from block 0 of cache sequentially.

Likewise a particular block j in Main Memory will map to j mod m block in cache where m is the number

of blocks in cache.

Direct mapping Cache line table:

m = number of memory blocks in cache

S = number of bits to identify the main memory block number

Main Memory block (j) Cache line (i)

0, m, 2m, 3m, …, 2S-m 0

1, m+1, 2m+1, 3m+1, …, 2S-m+1 1

.. …

m-1, 2m-1, 3m-1, …, 2S-1 m-1

*note

Use of a portion of the address as line number provides a unique mapping.

When more than one memory block maps to same cache line, it is necessary to distinguish them using

tag.

Pros and Cons of Direct Mapping

Pros:

Simple

Inexpensive

Cons:

One fixed location for given block.

o If a program accesses two blocks that map to the same line repeatedly, cache misses are

very high.

o It leads to trashing.

31 | P a g e

2. Associative Mapping Main memory block can be load into any cash block that is available

There are two parts in a Main Memory address when we consider Associative mapping.

If we take the same example of 64KB cache and 16MB Main Memory, the address will be like follows.

Figure 14: Two main Components of a Main Memory address

*note

In Associative mapping, a main memory block can be load into any cash block. Therefore the main

memory block number is considered as the tag.

Every cash line’s tag is examined for a match.

Cache searching is expensive.

Figure 15: Associative Mapping Process

32 | P a g e

Pros and Cons of Associative Mapping

Pros:

Any main memory block can be mapped to any cache memory block

Less swapping in temporal and spatial coherence (no thrashing)

Cons:

Have to search in all the cache lines for the particular tag that the Main memory address

3. Set Associative Mapping A combination of Direct mapping and Associative mapping

Cache is divided into a number of sets.

Each set contains a number of cache blocks/ cache lines.

A given block maps to any line in the particular set that block mapped to.

Example: 2 way associative mapping

Two lines per set

A given block can be in one of 2 lines in the set which that block belongs to

Figure 16: Structure of a Cache memory with sets

Suppose there are m number of cache blocks in the cache memory.

𝑚 = 𝑣 x 𝑘

v = number of sets within the cache

k = number of lines (vacancies or cache blocks) within a set

33 | P a g e

Every Block in main memory maps to one particular set in the cache.

Within that set there are a number of vacancies available.

The main memory block can be mapped to any vacant block within that particular set.

Replacement mechanisms are needed if that particular set is full, otherwise no.

Mapping a Main Memory Block to a set

Suppose i is the set number of a given main memory block.

𝑖 = 𝑗 𝑚𝑜𝑑 𝑣

j = main memory block number

v = number of sets available within the cache

Accordingly 0th to (v-1)th main memory blocks maps to 0th to (v-1)th sets consequently. vth main memory

block again starts from mapping to 0th set and so on.

If we have v number of sets, let v = 2d

Now d is the number of bits used to represent the set.

Figure 17: Components of a Main Memory Address in Set Associative Mapping

If the tag of the required main memory address is available in the particular set, return the word to CPU.

Identical tags are not coming to the same set. Therefore tag is unique to the set.

34 | P a g e

If we take the same example of 64KB cache and 16MB Main Memory for 2 way set associative mapping,

the address can be divided into 3 parts as follows.

Assume Block size is 4 words, 1 byte per 1 word, and size of a block is 4 bytes

𝑆𝑖𝑧𝑒 𝑜𝑓 𝐶𝑎𝑐ℎ𝑒 𝑚𝑒𝑚𝑜𝑟𝑦 = 64 𝐾𝐵

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐵𝑙𝑜𝑐𝑘𝑠 𝑖𝑛 𝑡ℎ𝑒 𝐶𝑎𝑐ℎ𝑒 𝑚𝑒𝑚𝑜𝑟𝑦 =64 𝐾𝐵

4 𝐵= 16 𝐾

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑎𝑐ℎ𝑒 𝑙𝑖𝑛𝑒𝑠 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑙𝑜𝑐𝑘𝑠 𝑖𝑛 𝐶𝑎𝑐ℎ𝑒 = 16 𝐾

Since 2 way set associative mapping is considered, a set contains 2 line or 2 cache blocks

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑒𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑎𝑐ℎ𝑒 = 𝑣 = 16 𝐾

2= 8 𝐾 = 213

Now it is in the correct format of v = 2d

Therefore we need 13 bits to represent the set number which a Main memory address belongs to.

Remaining 9 bits of the main memory block number is taken as the tag that identifies a particular

main memory block uniquely within the set.

Figure 18: Three main components of Main memory address

35 | P a g e

Cache Replacement Algorithms There is the possibility of mapped cache memory becoming fully occupied. At such an instance removing

an existing block from cache and loading the new block to cache is done.

Replacement is depends on the mapping mechanism.

Mapping mechanism Moment where replacing will be needed

How replacement mechanism is done

Direct Mapping if the mapped cache block is full No choice that particular block have to be replaced

Associative Mapping if all the cache blocks are full Hardware implemented algorithm (fast) * Least Recently Used (LRU) * First In First Out (FIFO) * Least Frequently Used (LFU) * Random

Set Associative Mapping if the mapped set is full

Least Recently Used (LRU)

Replace that block in the set that has been in the cache longest with no reference to it. For two – way

set associative, this is really implemented. Each cache line includes a USE bit. When a line is referenced,

its USE bit is set to 1 and the USE bit of the other line in that set is set to 0. When a block is to read into

the set, the line whose USE bit is 0 is used. Because we are assuming that more recently used memory

locations are more likely to be referenced, LRU should give the best hit ratio.

LRU is also relatively easy to implement for a fully associative cache. The cache mechanism maintains a

separate list of indexes to all the lines in the cache. When a line is referenced, it moves to the front of

the list. For replacement, the line at the back of the list is used. Because of its simplicity of

implementation, LRU is the most popular replacement algorithm.

First In First Out (FIFO)

Replace that block in the set that has been in the cache the longest. FIFO is easily implemented as a

round-robin or circular buffer technique.

Least Frequently Used (LFU)

Replace that block in the set that has experienced the fewer references. LFU could be implemented by

associating a counter with each line.

Random

A technique not based on usage (i.e., not LRU, LFU, FIFO, or some variant) is to pick a line at random

from among the candidate lines. Simulation studies have shown that random replacement provides only

slightly inferior performance to an algorithm based on usage.

36 | P a g e

Write Policy When a block that is in the cache is to be replaced, there are 2 cases to consider,

1. If the old block in the cache has not been modified, then overwriting can be done without any

issue.

2. If the old block in the cache has been modified, then main memory must be updated by writing

the line of cache out to the block of main memory before bringing the new block to that place.

There are 2 problems related to writing back to main memory:

1. More than one device have the access to main memory.

Ex: An I/O module may be able to read-write directly to memory. If a word has been

altered only in the cache, then the corresponding memory word is invalid. If the I/O

device has altered main memory, then the cache word is invalid.

2. Multiple processors are attached to the same bus and each processor has its own local cache.

If a word is altered in one cache, it could be conceivably invalidate a word in other

caches.

There are 2 techniques for Write Policy:

1. Write through policy

2. Write back policy

Write through policy

All write operations are made to main memory as well as to the cache, ensuring that main

memory is always valid.

Any other processor-cache module can monitor traffic to main memory to maintain consistency

within its own cache.

The main disadvantage of this technique is that it generates substantial memory traffic and may

create a bottleneck. Overall performance will go down this way.

Write back policy

In this technique updates are made only in the cache.

When an update occurs, a dirty bit, or use bit, associated with the line is set. Then, when a block

is replaced, it is written back to main memory if and only if the dirty bit is set.

The problem with write back policy is that portions of main memory are invalid, and hence

accesses by I/O modules can be allowed only through the cache. This makes for complex

circuitry and a potential bottleneck.

Sir did not talk about cache coherency

37 | P a g e

Line Size (Block Size) As the block size increases from very small to larger sizes, the hit ratio will at first increase because of

the principle of locality. The hit ratio will began to decrease as the block becomes even bigger.

Two specific effects come into play when block sizes are getting larger:

Reduces the number of blocks that fit into main memory

Some additional words are farther from the requested word and therefore less likely to be

needed in near future

Number of caches When caches were introduced originally systems used only one cache. More recently, the use of

multiple caches has become the norm.

Two aspects of this design issue concerns,

1. The number of cache levels

2. The use of unified vs split caches

Cache Performance Cache has an important effect on the overall system performance.

𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 = (𝐶𝑃𝑈 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑐𝑦𝑐𝑙𝑒𝑠 + 𝑀𝑒𝑚𝑜𝑟𝑦 𝑠𝑡𝑎𝑙𝑙 𝑐𝑦𝑐𝑙𝑒𝑠) x 𝐶𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒

𝑀𝑒𝑚𝑜𝑟𝑦 𝑠𝑡𝑎𝑙𝑙 𝑐𝑦𝑐𝑙𝑒𝑠 = 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 x 𝑚𝑖𝑠𝑠𝑒𝑠 𝑝𝑒𝑟 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 x 𝑀𝑖𝑠𝑠 𝑝𝑒𝑛𝑎𝑙𝑡𝑦

As CPU increases in performance, the memory stall cycles have an increasing effect on the overall

performance.

How to reduce the memory stall time:

Reduce miss rate (better cache strategies)

o Multilevel cache with on chip small cache (very fast), possibly set associative, and large

off chip cache, probably direct mapped

Reduce the miss penalty (fast memory)

o Increase bandwidth to main memory (wider bus)

*note

Read Pentium IV cache Organization.

38 | P a g e

Lesson 04 – Virtual Memory If the system uses 24bit addresses, the addressable number of units equals to 224. How can we have a

larger number than that? LIMITATION

VM is a concept that emerged to overcome the space limitation in Main Memory.

VM is a technique that allows the execution of processes which are not completely available in memory.

The main advantage of this scheme is that programs can be larger than physical memory. VM is the

separation of logical memory from physical memory.

This separation allows an extremely large virtual memory to be provided for programmers when only a

smaller physical memory is available. Following are the situations, when entire program is not required

to be loaded fully in Main memory.

User written error handling routines are used only when an error occurred in the data or

computation.

Certain options and features of a program may be used rarely.

Many tables are assigned a fixed amount of address space even though only a small amount of

the table is actually used.

The ability to execute a program partially in memory would counter many benefits.

Less number of I/O would be needed to load or swap each user program into main memory.

A program would no longer be constrained by the amount of physical memory that is available.

Each user program could take less physical memory; more programs could be run the same

time, with a corresponding increase in CPU utilization and throughput.

Since VM had being introduced to acts as its Main Memory (MM) were much larger than the actual size,

the programmers can think that they have unlimited memory space.

Figure 19: Virtual Memory concept

39 | P a g e

VM terminology

Page:

o equivalent of “block” fixed size

Page faults:

o equivalent of “misses”

Virtual address:

o equivalent of “tag”

No cache index equivalent: fully associative. VM table index appears becoz VM uses a different

(page table) implementation of fully associative.

Physical address:

o translated value of virtual address, can be smaller than virtual address, no equivalent in

caches

Memory mapping (address translation):

o converting virtual to physical addresses, no equivalent in caches

Valid bit:

o Same as in caches

Referenced bit:

o Used to approximate LRU algorithm

Dirty bit:

o Used to optimize write-back

VM VM fits lots of programs and program data into Actual MM.

Every program has its own Virtual address space starting from zero. They maintain separate table called

page table which can be uniquely identified by Process ID (PID). It does mapping of VM addresses to

cache, MM, Secondary storage addresses. There is a another table called transaction Look aside Buffer

which keeps most recently used page numbers. It is a fast semiconductor memory. The TLBs are

identified uniquely from the Process ID (PID). Each program feels that only that particular process is

running in CPU.

Figure 20: Virtual address

40 | P a g e

Figure 21: Virtual address space for the program which has memory blocks of A, B, C, and D

In the above manner program size should not be known beforehand and program size could be changed

dynamically.

Goals of VM:

Illusion of having more physical memory

Program relocation support (relieves programmer burden)

Protection due to one program does not read/write data of another

Since this is an indirect mechanism it delays, but the overall performance will increase significantly.

Virtual memory implementation techniques:

1. Paged

2. Segmentation

3. combined

Paged implementation:

Overall program resides on larger memory

Address space divided into virtual pages with equal size

MM divided into page frames of same size as pages in low level memory

Map virtual page to physical page by using page table

TLB is used to keep recently used page numbers

41 | P a g e

Segmented implementation:

Program is not viewed as a single sequence of instruction and data

Arranged into several modules of code, data, and stacks

Each module called segment – segment sector

Different sizes

Associated with segment registers

o Ex: Stack, Data, Program segment registers

Figure 22: Paging vs Segmentation

*note

A scheme that allows the use of variable size segments can be useful from a programmer's point of

view, since it lends itself to the creation of modular programs, but the operating system now not only

has to keep track of the starting address of each segment, but since they are variable in size, must also

calculate the offset to the end of each segment. Some systems combine paging and segmentation by

implementing segments as variable-size blocks composed of fixed-size pages.

VM design issues:

Miss penalty huge: Access time of disk = millions of cycles

o Highest priority to minimize page faults

o Use write back policy instead of write through. This is called copy-back in VM. For

optimization purposes it uses dirty bit to clarify whether that page is modified and has

to be copied back.

o If there is a page fault, OS schedules another process.

Protection support

o Break up program’s code and data into pages. Add process ID to cache index; use

separate tables for different programs

o OS is called via an exception: handles page faults

42 | P a g e

How a particular virtual address is mapped with the physical memory address.

Figure 23: Vitual address mapping to physical address

When a certain virtual address of a process is asked by the CPU, virtual page number is extracted and it

is first hunted at TLB, if found present the content to CPU else if that page is not available within TLB,

i.e., the content of that address is not recently used. Next it is hunted at page table. If found present the

content to CPU, else if it is invalid in page table, i.e., it is not in MM even, then go to secondary memory

and bring the content to MM and then present the content to CPU.

Figure 24: CPU->TLB->Page table

43 | P a g e

Figure 25: TLB and caches, action hierarchy

44 | P a g e

Lesson 04 – Register Transfer Language and Micro-Operations