© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 2: Basic Notions and Fundamentals.

© Wen-mei Hwu and S. J. Patel, 2005ECE 511, University of Illinois

Lecture 2: Basic Notions and

Fundamentals


Outline

• Basic computer organization– von Neumann model and execution cycle– Pipelines and caches

• Architectural drivers– Technology– Applications and compatibility– Compilers

• Measures– Methodology– Key measures


von Neumann’s Contribution

• Fetch• Decode• Evaluate

addresses• Fetch operands• Execute• Store results

…

Control

Datapath

Memory

program

Input/Output

Instruction Cycle


Pipelining

• Instruction latency does not decrease• Throughput increases• Dependencies degrade performance• General trends: deeper/wider until

2004

Fetch Execute MemDecode Write


Fetch

adder

ALU

ALU

BR

MemDeco

de/S

wap

Stalllogic

write

write

write

Multiple Issue


Pipeline of the Pentium IV[Courtesy of Intel Corp]

• Enables very high clock rate• Enables scalability from one

technology to the next• But does it always lead to highest

performance?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

TC nxt IP TC Fetch Drv Alloc Rename Que Sch Sch Sch Disp Disp RF RF Ex FlgsBrCk Drv


Speeding up memory

• Fundamental: Principles of locality of memory references– Reference to X another reference to X later– Reference to X reference to Y, where X, Y are

close

• Fundamental: Memory tends to be small/fast/expensive or large/slow/cheap

• Result: Memory hierarchies with caching


Basic Cache Organization

…

tag index offset

Data

Address

hit/miss

cache linetag valid

Direct Mapped Cache


CMOS Inverter

Vdd

Vss

Input Output

Schematic Notation

p+ p+ n+ n+

Cross-section of inverter in n-well process

n well p substrate

metalfield oxide

polysilicongate oxide

p-MOS trans n-MOS trans

VssVdd

gate

p-MOS trans

n-MOS trans

gate

gate


CMOS Trends[Collated from the 2000 update of the International Technology Roadmap for

Semiconductors]

Year 1995 1998 2001 2004 2007 2010 2013

Feature Size

(nanometers)

350nm

180nm

130nm

90nm 65nm 45nm 33nm

TransistorCount

10M 50M 110M 350M 1300M 3500M

11000M

n+ n+

For high-performance CPUs, the feature sizeis (typically) the minimum width of a gate


• Delay is proportional to the number of gates– typical measure FO4

• Power dissipation– dynamic (switching)– Static (leakage)

An Microarchitect’s Model of CMOS Power and Delay

Vdd

Vss

Input Output

p-MOS trans

n-MOS trans


CMOS Trends[Collated/extrapolated from the 2000 update of

the International Technology Roadmap for

Semiconductors]Year 1995 1998 2001 2004 2007 2010 2013

Feature Size

(nanometers)

350nm

180nm

130nm

90nm 65nm 45nm 33nm

TransistorCount

10M 50M 110M 350M 1300M 3500M

11000M

Projected FO4

delaysIn each

pipe satge

27 13 11 8 7 6 6

This will likely be revised due to the transistor variability problem from 65nm generation on.


• Transistors become more plentiful but variable

• Gates become faster• Wires become relatively slower

• Memory becomes relatively slower• Power becomes a critical issue• Noise, error rates, design complexity

CMOS Trends


Trends in hardware• High variability

– Increasing speed and power variability of transistors

– Limited frequency increase– Reliability / verification

challenges

• Large interconnect delay– Increasing interconnect

delay and shrinking clock domains

– Limited size of individual computing engines

Interconnect RC DelayInterconnect RC Delay

1

10

100

1000

10000

350 250 180 130 90 65

Del

ay (

ps)

Clock Period

RC delay of 1mm interconnect

Copper Interconnect

130nm

30%

5X0.90.9

1.01.0

1.11.1

1.21.2

1.31.3

1.41.4

11 22 33 44 55Normalized Leakage (INormalized Leakage (Isbsb))

No

rmal

ized

Fre

qu

ency

No

rmal

ized

Fre

qu

ency

Source: Shekhar Borkar, Intel


Dynamic-Static Interface

• Moving functionality into compilers– Most architectures today, including X86 rely on

compilers to achieve performance goals– A major issue is the number of bits required to

deliver information from the compiler to the runtime hardware

– Highly optimizing compiler also reduced the incentive for machine language programming, which makes portable programs a reality.

– The more implementation details one exposes in the instruction set architecture, the more difficult it is to adopt new implementation techniques.


Applications

• Applications drive architecture from ‘above”

• Designing next-generation computers involves understanding the behavior of applications of importance and exploiting their characteristics

• What are applications of importance?

• For us, we will often use benchmarks to characterize different architectural options


In the image of the simple hardware,

humans created complex software• Software creation based on a

simple execution model– Instructions are considered to

execute sequentially– Data objects are mapped into a

flat, monolithic store reachable by all

– Reality when laid out by von Neumann in the 40’s; abstraction now

• This execution abstraction has been used in development of large, complex software– “Traditional software model”


Future apps reflect a concurrent world

• Exciting applications in future mass computing market have been traditionally considered “supercomputing applications”– Physiological simulation – cellular pathways (GE Research)– Molecular dynamics simulation (NAMD at UIUC)– Video and audio coding and manipulation – MPEG-4

(NCTU)– Medical imaging – CT (UIUC)– Consumer game and virtual reality products

• These “Super-applications” represent and model physical world

• Various granularities of parallelism exist, but…– programming model must support required dimensions– data delivery needs careful management


Direction of computer architecture

• Current general purpose architectures cover traditional applications• New parallel-privatized architectures cover some super-applications• Attempts to grow current architectures “out” or domain-specific

architectures “in” lack success• By properly exploiting parallelism of super-applications, the

coverage of domain-specific architectures can be extended

Traditional applications

Current architecture coverage

New applications

Domain-specificarchitecture coverage

Obstacles


Compatibility

• The case of workstations and servers– Relatively open to new architectures.– Performance is a major concern.– Linux/UNIX is a portable operating system.– Current economics model works against new

architectures

• The case of personal computers– Very tough on new architectures.– Windows and Apple OS are not portable– Most applications are distributed in binary

code– Price is of more concern than performance.


Experimental Methodology

• Selected real programs by characterizing workload– PERFECT Club, SPEC, MediaBench– What do these programs do?– What input was given to these programs?– How are they related to your own workload?– What do experimental results mean?

• Require high quality software support.– Tremendous variation in capability

• Trace driven simulation vs. re-compilation

• Nothing can replace real-machine measurements


Measures

Iron Law: performance = 1 / execution time= 1/ (CPI * insts * 1/freq)(Basis of SPEC Marks)or (IPC * freq)/insts

CPI : Cycles per instruction (how is this calculated)?

IPC : Instructions per cycle

other useful measures : average memory access time

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 2: Basic Notions and Fundamentals.

Documents

wenmei hwu

university of illinois

university of illinois

caching slide

multiple issue slide

process n

x reference

memory fundamental