Chapter 1

CPE 408340Computer Organization

Chapter 1: Computer Abstractionsand Technology

Sa’ed R. Abed[Computer Engineering Department,

Hashemite University][Adapted from Otmane Ait Mohamed Slides &

Computer Organization and Design, Patterson & Hennessy, © 2005, UCB]

1

Course AdministrationInstructor: Sa’ed Rasmi AbedInstructor's e-mail: [email protected]

Office Hours: Mon, wed: 9:00 - 10:00 or by appointment

Lecture Time: Mon, wed: 12:30 - 2:00

Text: Required: Computer Org and Design, 4th

Edition, Patterson and Hennessy ©2008Optional: Computer Organization and Architecture: Designing for Performance, 7th Edition, William Stallings, published by Prentice Hall, July 2005.

Slides : pdf on the course web page (Moodle System)2

mailto:[email protected]

Course ContentContent

Principles of computer architecture: CPU datapath and controlunit design (single-issue pipelined, superscalar, VLIW), memoryhierarchies and design, I/O organization and design, advancedprocessor design (multiprocessors).

Course goalsTo learn the organizational paradigms that determine thecapabilities and performance of computer systems. Tounderstand the interactions between the computer’s architectureand its software so that future software designers (compilerwriters, operating system designers, database programmers, …)can achieve the best cost-performance trade-offs and so thatfuture architects understand the effects of their design choices onsoftware applications.

Course prerequisitesCPE 408330: Assembly Language and Microprocessor Systems.

3

What You Should KnowBasic logic design & machine organization:

logical minimization, FSMs, component designprocessor, memory, I/O.

To learn the organizational paradigms that determinesthe capabilities and performance of computer systems.

Create, assemble, run, debug programs in anassembly language:

MIPS preferred.

To explore the memory hierarchy system and how tointerface it to a computer.

4

Course StructureDesign focused class:

• Various homework assignments throughout the semester

Lectures:Computer Abstractions and Technology

Instructions: Language of the Computer

Arithmetic for Computers

Review and First Exam

The processor

Review and Second Exam

Exploiting Memory Hierarchy

Review and Final Exam5

Chapter 1 (2 Weeks)(Sec. 1.1 to 1.4)Chapter 2 (2 1/2 Weeks)(Sec. 2.1 to 2.7 & 2.10)Chapter 3 (1 1/2 Weeks)(Sec. 3.1 to 3.4)(1/2 Week)

Chapter 4 (4 Weeks)(Sec. 4.1 to 4.9)(1/2 Week)

Chapter 5 (3 Weeks)(Sec. 5.1 to 5.3 & 5.5)(1/2 Week)

Grading InformationGrade determinates

• First Exam ~20%Monday, March 12th.

• Second Exam ~25%Monday, April 16th.

•Final Exam ~50%TBD

• Class participation & pop quizzes ~5%

Let me know about any exam conflicts ASAP

6

Ethics and ProfessionalismEthics

Disciplined dealing with moral duty.Moral Principles or Practice.System of right behavior.

ProfessionalismThe conduct, aims or qualities that characterize a professional person.

7

What characterizes a “professional”?

a professional accepts responsibility fully – does not blame others for failure.

a professional is reliable - gets the job done on time.

a professional is competent - gets the correct answer.

a professional works independently – finds out what he/she does not know.

a professional follows up on all the details.

a professional has high standards of ethical behavior – does not lie or cheat.

a professional does not steal the work of others and present it as his own.

8

What characterizes a “professional”?

a professional is respectful to others.

a professional does not offer excuses in lieu of completed work.

a professional is resourceful.

a professional has initiative.

a professional succeeds in spite of obstacles and road blocks.

a professional has justifiable self-confidence.

9

The Student is the Product of our Engineering School

We are an accredited engineering school: our product is engineering professionals.

Employers expect our graduates to behave like professionals.

Employers seek the qualities of a professional in job interviews.

Professionalism must start in the first semester and be part of every course over four years.

10

The Student is the Product of our Engineering School

Every student must learn to “think like an engineer”:o accept responsibility for his/her own learning o follow up on lecture material and homeworko learn problem-solving skills not just how to solve each specific

homework problemo build a body of knowledge integrated over four years of courses

We all want HU’s excellent reputation to be reinforced so that employers will hire our graduates!

11

By the architecture of a system, I mean the complete and detailed specification of the user interface. … As Blaauw has said, “Where architecture tells whathappens, implementation tells how it is made to happen.”

The Mythical Man-Month, Brooks, pg 45

12

Moore’s Law

In 1965, Gordon Moore predicted that the number of transistors that can be integrated on a die would double every 18 to 24 months (i.e., grow exponentially with time).

Amazingly visionary – million transistor/chip barrier was crossed in the 1980’s.

2300 transistors, 1 MHz clock (Intel 4004) - 197116 Million transistors (Ultra Sparc III)42 Million transistors, 2 GHz clock (Intel Xeon) – 200155 Million transistors, 3 GHz, 130nm technology, 250mm2 die (Intel Pentium 4) - 2004140 Million transistor (HP PA-8500)

13

Where is the Market?

290

933

488

1143

892

1354

862

1294

1122

1315

0

200

400

600

800

1000

1200

1998 1999 2000 2001 2002

EmbeddedDesktopServers

Milli

ons

of C

ompu

ters

14

Processor Performance Increase

1

10

100

1000

10000

1987 1989 1991 1993 1995 1997 1999 2001 2003

Year

Perf

orm

ance

(SPE

C In

t)

SUN-4/260 MIPS M/120MIPS M2000

IBM RS6000

HP 9000/750

DEC AXP/500 IBM POWER 100

DEC Alpha 4/266DEC Alpha 5/500

DEC Alpha 21264/600

DEC Alpha 5/300

DEC Alpha 21264A/667Intel Xeon/2000

Intel Pentium 4/3000

15

Growth Capacity of DRAM Chips

K = 1024 (210)In recent years growth rate has slowed to 2x every 2 year

16

The Evolution of Computer Hardware

When was the first transistor invented?Modern-day electronics began with the invention in 1947 of the transfer resistor - the bi-polar transistor -by Bardeen et.al at Bell Laboratories

18

The Evolution of Computer Hardware

When was the first IC (integrated circuit) invented?

In 1958 the IC was “born” when Jack Kilby at Texas Instruments successfully interconnected, by hand, several transistors, resistors and capacitors on a single substrate

20

The Underlying Technologies

Year Technology Relative Perform/Unit Cost1951 Vacuum Tube 11965 Transistor 351975 Integrated Circuit (IC) 9001995 Very Large Scale IC (VLSI) 2,400,0002005 Submicron VLSI 6,200,000,000

What if technology in the transportation industry advanced at the same rate?

21

The PowerPC 750

Introduced in 1999

3.65M transistors

366 MHz clock rate

40 mm2 die size

250nm (0.25micron) technology

22

Technology Outlook

High Volume Manufacturing

2004 2006 2008 2010 2012 2014 2016 2018

Technology Node (nm)

90 65 45 32 22 16 11 8

Integration Capacity (BT)

2 4 8 16 32 64 128 256

Delay = CV/I scaling

0.7 ~0.7 >0.7 Delay scaling will slow down

Energy/Logic Op scaling

>0.35 >0.5 >0.5 Energy scaling will slow down

Bulk Planar CMOS High Probability Low ProbabilityAlternate, 3G etc Low Probability High ProbabilityVariability Medium High Very HighILD (K) ~3 <3 Reduce slowly towards 2 to 2.5RC Delay 1 1 1 1 1 1 1 1Metal Layers 6-7 7-8 8-9 0.5 to 1 layer per generation

23

Impacts of Advancing Technology

Processorlogic capacity: increases about 30% per yearperformance: 2x every 1.5 years

MemoryDRAM capacity: 4x every 3 years, now 2x every 2 yearsmemory speed: 1.5x every 10 yearscost per bit: decreases about 25% per year

Diskcapacity: increases about 60% per year

ClockCycle = 1/ClockRate

500 MHz ClockRate = 2 nsec ClockCycle1 GHz ClockRate = 1 nsec ClockCycle4 GHz ClockRate = 250 psec ClockCycle

25

Computer Organization and DesignThis course is all about how computers work

But what do we mean by a computer?Different types: embedded, laptop, desktop, server

Different uses: automobiles, graphics, finance, genomics,…

Different manufacturers: Intel, AMD, IBM, HP, Apple, IBM, Sony, Sun …

Different underlying technologies and different costs !

Best way to learn:Focus on a specific instance and learn how it works

While learning general principles and historical perspectives26

Example Machine OrganizationWorkstation design target

25% of cost on processor25% of cost on memory (minimum memory size)Rest on I/O devices, power supplies, box

CPU

Computer

Control

Datapath

Memory Devices

Input

Output

27

Embedded Computers in You Car

28

Why Learn this Stuff?

You want to call yourself a “computer scientist/engineer”

You want to build HW/SW people use (so you need to deliver performance at low cost)

You need to make a purchasing decision or offer “expert” adviceBoth hardware and software affect performance

The algorithm determines number of source-level statements

The language/compiler/architecture determine the number of machine-level instructions

- (Chapters 1, 2 and 3)

The processor/memory determine how fast machine-level instructions are executed

- (Chapters 4, and 5) 29

What is a Computer?Components:

processor (datapath, control)input (mouse, keyboard)output (display, printer)memory (cache (SRAM), main memory (DRAM), disk drive, CD/DVD)network

Our primary focus: the processor (datapath and control)

Implemented using millions of transistorsImpossible to understand by looking at each transistorWe need abstraction!

30

Major Components of a Computer

31

PC Motherboard Closeup

32

Inside the Pentium 4 Processor Chip

33

Below the ProgramHigh-level language program (in C)

swap (int v[], int k)(int temp;

temp = v[k];v[k] = v[k+1];v[k+1] = temp;

)

Assembly language program (for MIPS)swap: sll $2, $5, 2

add $2, $4, $2lw $15, 0($2)lw $16, 4($2)sw $16, 0($2)sw $15, 4($2)jr $31

Machine (object) code (for MIPS)000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000. . .

C compiler

assembler

one-to-many

one-to-one

35

Advantages of Higher-Level Languages ?Higher-level languages

As a result, very little programming is done today at the assembler level

Allow the programmer to think in a more natural language and for their intended use (Fortran for scientific computation, Cobol for business programming, Lisp for symbol manipulation, Java for web programming, …)Improve programmer productivity – more understandable code that is easier to debug and validateImprove program maintainabilityAllow programs to be independent of the computer on which they are developed (compilers and assemblers can translate high-level language programs to the binary instructions of any machine)Emergence of optimizing compilers that produce very efficient assembly code optimized for the target machine

37

Machine OrganizationCapabilities and performance characteristics of the principal Functional Units (FUs)

e.g., register file, ALU, multiplexors, memories, ...

The ways those FUs are interconnected

e.g., buses

Logic and means by which information flow between FUs is controlled

The machine’s Instruction Set Architecture (ISA)

Register Transfer Level (RTL) machine description

38

Instruction Set Architecture (ISA)ISA: An abstract interface between the hardware and the lowest level software of a machine that encompasses all the information necessary to write a machine language program that will run correctly, including instructions, registers, memory access, I/O, and so on.

“... the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls, the logic design, and the physical implementation.”

– Amdahl, Blaauw, and Brooks, 1964Enables implementations of varying cost and performance to run identical software

ABI (application binary interface): The user portion of the instruction set plus the operating system interfaces used by application programmers. Defines a standard for binary portability across computers.

39

ISA Type Sales

0

200

400

600

800

1000

1200

1400

1998 1999 2000 2001 2002

OtherSPARCHitachi SHPowerPCMotorola 68KMIPSIA-32ARM

PowerPoint “comic” bar chart with approximate values (see text for correct values)

Mill

ions

of P

roce

ssor

40

Major Components of a Computer

Processor

Control

Datapath

Memory

Devices

Input

Output

Network

41

Below the Program

C compiler

assembler

High-level language program (in C)swap (int v[], int k). . .

Assembly language program (for MIPS)swap: sll $2, $5, 2

add $2, $4, $2lw $15, 0($2)lw $16, 4($2)sw $16, 0($2)sw $15, 4($2)jr $31

Machine (object) code (for MIPS)000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000100011 00010 01111 0000000000000000100011 00010 10000 0000000000000100101011 00010 10000 0000000000000000101011 00010 01111 0000000000000100000000 11111 00000 0000000000001000 43

Input Device Inputs Object Code

Processor

Control

Datapath

Memory

000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000100011 00010 01111 0000000000000000100011 00010 10000 0000000000000100101011 00010 10000 0000000000000000101011 00010 01111 0000000000000100000000 11111 00000 0000000000001000

Devices

Input

Output

Network

44

Object Code Stored in Memory

Processor

Control

Datapath

Memory000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000100011 00010 01111 0000000000000000100011 00010 10000 0000000000000100101011 00010 10000 0000000000000000101011 00010 01111 0000000000000100000000 11111 00000 0000000000001000

Devices

Input

Output

Network

45

Processor Fetches an Instruction

Processor

Control

Datapath

Memory000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000100011 00010 01111 0000000000000000100011 00010 10000 0000000000000100101011 00010 10000 0000000000000000101011 00010 01111 0000000000000100000000 11111 00000 0000000000001000

Processor fetches an instruction from memory

Devices

Input

Output

Network

46

Control Decodes the Instruction

Processor

Control

Datapath

Memory000000 00100 00010 0001000000100000

Control decodes the instruction to determine what to execute

Devices

Input

Output

Network

47

Datapath Executes the Instruction

Processor

Control

Datapath

Memory

contents Reg #4 ADD contents Reg #2results put in Reg #2

Datapath executes the instruction as directed by control

000000 00100 00010 0001000000100000

Devices

Input

Output

Network

48

What Happens Next?

Processor

Control

Datapath

Memory000000 00000 00101 0001000010000000000000 00100 00010 0001000000100000100011 00010 01111 0000000000000000100011 00010 10000 0000000000000100101011 00010 10000 0000000000000000101011 00010 01111 0000000000000100000000 11111 00000 0000000000001000

Fetch

DecodeExec

Devices

Input

Output

Network

Processor fetches the next instruction from memory

How does it know which location in memory to fetch

from next?

50

Processor OrganizationControl needs to have circuitry to

What location does it load from and store to?

Decide which is the next instruction and input it from memoryDecode the instructionIssue signals that control the way information flows between datapath componentsControl what operations the datapath’s functional units perform

Execute instructions - functional units (e.g., adder) and storage locations (e.g., register file) Interconnect the functional units so that the instructions can be executed as requiredLoad data from and store data to memory

Datapath needs to have circuitry to

52

Output Data Stored in Memory

Processor

Control

Datapath

Memory

000001000101000000000000000000000000000001001111000000000000010000000011111000000000000000001000

At program completion the data to be output resides in memory

Devices

Input

Output

Network

53

Output Device Outputs Data

Processor

Control

Datapath

Memory

000001000101000000000000000000000000000001001111000000000000010000000011111000000000000000001000

Devices

Input

Output

Network

54

The Instruction Set Architecture (ISA)

instruction set architecture

software

hardware

The interface description separating the software and hardware

55

The MIPS ISAInstruction Categories

Load/StoreComputationalJump and BranchFloating Point

- coprocessorMemory ManagementSpecial

R0 - R31

PCHILO

OP

OP

OP

rs rt rd sa funct

rs rt immediate

jump target

3 Instruction Formats: all 32 bits wide

Registers

Q: How many already familiar with MIPS ISA? 56

How Do the Pieces Fit Together?

I/O systemProcessor

Compiler

OperatingSystem

Applications

Digital DesignCircuit Design

Instruction SetArchitecture

Firmware

Coordination of many levels of abstraction

Under a rapidly changing set of forces

Design, measurement, and evaluation

Memory system

Datapath & Control

network

57

Performance MetricsPurchasing perspective

given a collection of machines, which has the- best performance ?- least cost ?- best cost/performance?

Design perspectivefaced with design options, which has the

- best performance improvement ?- least cost ?- best cost/performance?

Both requirebasis for comparisonmetric for evaluation

Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors

59

Which of these airplanes has the best performance?

Airplane Passengers Range (mi) Speed (mph)

Boeing 737-100 101 630 598Boeing 747 470 4150 610BAC/Sud Concorde 132 4000 1350Douglas DC-8-50 146 8720 544

How much faster is the Concorde compared to the 747? How much bigger is the 747 than the Douglas DC-8?

60

Response Time (latency)— How long does it take for my job to run?— How long does it take to execute a job?— How long must I wait for the database query?

Throughput— How many jobs can the machine run at once?— What is the average execution rate?— How much work is getting done?

If we upgrade a machine with a new processor what do we increase?

If we add a new machine to the lab what do we increase?

Computer Performance: TIME, TIME, TIME

61

Elapsed Timecounts everything (disk and memory accesses, I/O , etc.)a useful number, but often not good for comparison purposes

CPU timedoesn't count I/O or time spent running other programscan be broken up into system time, and user time

Our focus: user CPU time time spent executing the lines of code that are "in" our program

Execution Time

62

For some program running on machine X,

PerformanceX = 1 / Execution timeX

"X is n times faster than Y"

PerformanceX / PerformanceY = n

Problem:machine A runs a program in 20 secondsmachine B runs the same program in 25 seconds

Book's Definition of Performance

63

Defining (Speed) PerformanceNormally interested in reducing

Response time (aka execution time) – the time between the start and the completion of a task

- Important to individual usersThus, to maximize performance, need to minimize execution time

Throughput – the total amount of work done in a given time- Important to data center managers

Decreasing response time almost always improves throughput

performanceX = 1 / execution_timeX

If X is n times faster than Y, then

performanceX execution_timeY -------------------- = --------------------- = nperformanceY execution_timeX

64

Performance FactorsWant to distinguish elapsed time and the time spent on our task

CPU execution time (CPU time) – time the CPU spends working on a task

Does not include time waiting for I/O or running other programs

CPU execution time # CPU clock cyclesfor a program for a program

= x clock cycle time

CPU execution time # CPU clock cycles for a programfor a program clock rate

= -------------------------------------------

Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program

or

65

Review: Machine Clock RateClock rate (MHz, GHz) is inverse of clock cycle time (clock period)

CC = 1 / CR

one clock period

10 nsec clock cycle => 100 MHz clock rate



1 nsec clock cycle => 1 GHz clock rate

500 psec clock cycle => 2 GHz clock rate

250 psec clock cycle => 4 GHz clock rate

200 psec clock cycle => 5 GHz clock rate66

Clock Cycles

Instead of reporting execution time in seconds, we often use cycles

Clock “ticks” indicate when to start activities (one abstraction):

cycle time = time between ticks = seconds per cycle

clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)

A 4 Ghz. clock has a cycle time

time

secondsprogram

=cycles

program×

secondscycle

(ps) spicosecond 2501210 9104

1 =××

67

So, to improve performance (everything else being equal) you can

either (increase or decrease?)

________ the # of required cycles for a program, or

________ the clock cycle time or, said another way,

________ the clock rate.

How to Improve Performance

secondsprogram

=cycles

program×

secondscycle

68

Could assume that number of cycles equals number of instructions

This assumption is incorrect,

different instructions take different amounts of time on different machines.

Why? hint: remember that these are machine instructions, not lines of C code

time

1st i

nstru

ctio

n

2nd

inst

ruct

ion

3rd

inst

ruct

ion

4th

5th

6th ...

How many cycles are required for a program?

69

Multiplication takes more time than addition

Floating point operations take longer than integer ones

Accessing memory takes more time than accessing

registersImportant point: changing the cycle time often changes the number of cycles required for various instructions

time

Different numbers of cycles for different instructions

70

CSE431 L01 Introduction.71 Irwin, PSU, 2005

Clock Cycles per InstructionNot all instructions take the same amount of time to execute

One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction

Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute

A way to compare two different implementations of the same ISA

# CPU clock cycles # Instructions Average clock cyclesfor a program for a program per instruction = x

CPI for this instruction classA B C

CPI 1 2 3

71

Effective CPIComputing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging

Overall effective CPI = Σ (CPIi x ICi)i = 1

n

Where ICi is the count (percentage) of the number of instructions of class i executedCPIi is the (average) number of clock cycles per instruction for that instruction classn is the number of instruction classes

The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs

72

THE Performance EquationOur basic performance equation is then

CPU time = Instruction_count x CPI x clock_cycle

Instruction_count x CPI

clock_rate CPU time = -----------------------------------------------

or

These equations separate the three key factors that affect performance

Can measure the CPU execution time by running the programThe clock rate is usually givenCan measure overall instruction count by using profilers/ simulators without knowing all of the implementation detailsCPI varies by instruction type and ISA implementation for which we must know the implementation details

73


Determinates of CPU PerformanceCPU time = Instruction_count x CPI x clock_cycle

Instruction_count

CPI clock_cycle

Algorithm

Programming languageCompiler

ISA

Processor organizationTechnology X

XX

XX

X X

X

X

X

X

X

75


A Simple Example

How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?

How does this compare with using branch prediction to save a cycle off the branch time?

What if two ALU instructions could be executed at once?

Op Freq CPIi Freq x CPIiALU 50% 1

Load 20% 5

Store 10% 3

Branch 20% 2

Σ =

.5

1.0

.3

.4

2.2

CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster

1.6

.5

.4

.3

.4

.5

1.0

.3

.2

2.0

CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster

.25

1.0

.3

.4

1.95

CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster77

Comparing and Summarizing Performance

Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))

How do we summarize the performance for benchmark set with a single number?

The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM)

AM = 1/n Σ Timeii = 1

n

Where Timei is the execution time for the ith program of a total of n programs in the workloadA smaller mean indicates a smaller average execution time and thus improved performance

78

Performance is specific to a particular program/sTotal execution time is a consistent summary of performance

For a given architecture performance increases come

from:increases in clock rate (without adverse CPI affects)improvements in processor organization that lower CPIcompiler enhancements that lower CPI and/or instruction countAlgorithm/Language choices that affect instruction count

Pitfall: expecting improvement in one aspect of a

machine’s performance to affect the total performance

Remember

79

Summary: Evaluating ISAsDesign-time metrics:

Can it be implemented, in how long, at what cost?Can it be programmed? Ease of compilation?

Static Metrics:How many bytes does the program occupy in memory?

Dynamic Metrics:How many instructions are executed? How many bytes does the processor fetch to execute the program?How many clocks are required per instruction?How "lean" a clock is practical?

Best Metric: Time to execute the program!

CPI

Inst. Count Cycle Timedepends on the instructions set, the processor organization, and compilation techniques.

80

Next Lecture and RemindersNext lecture

Instructions: Language of the Computer- Reading assignment – Chapter 2

82

Chapter 1

Documents

execution

processor

clock rate

cpu time

instruction

12 week

organizational

computer hardware