Overview Overview Z Jerry Shi Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475’03, Peh/ELE475’*
OverviewOverview
Z Jerry ShiZ. Jerry ShiAssistant Professor of Computer Science and Engineering
University of Connecticut
* Slides adapted from Blumrich&Gschwind/ELE475’03, Peh/ELE475’*
What is Computer Architecture?
• The manner in which the components of a computer or computer system are organi ed and integrated Mirriam Webster Dictionarare organized and integrated – Mirriam-Webster Dictionary
• The term architecture is used here to describe the attributes of a system as seen by the programmer, i.e., the conceptual structure and functional b h i di ti t f th i ti f th d t fl d t l thbehavior as distinct from the organization of the dataflow and controls, the logic design, and the physical implementation. – Gene Amdahl, IBM Journal of R&D, April 1964
hi b k h d hi i i d d ll h f• In this book the word architecture is intended to cover all three aspects of computer design – instruction set architecture, organization, and hardware. – H&P text
What is Computer Architecture?
• High-level programming language– x++;;
• Assembly languageLD F2, #1 ; Load number 1 into register F2LD F0, 0(R1) ; Load value x into register F0( ) gADD.D F4, F0, F2 ; Add F0 and F2, place in F4S.D F4, 0(R1) ; Store resulting sum into x
• Computer Architecture: How to make this work well, given an Adder, SRAMs, DRAM l k d i li i iDRAMs, clock and pipeline circuits, etc.
– Instruction fetch– Operand fetch
I t ti ti– Instruction execution– Writeback
Architecture vs. Design
• Computer architecture applies at many levels:– System architecture– Processor microarchitecture– Memory system architecturey y– Disk architecture– Cache architecture
N t k hit t– Network architecture• Architecture is the plan; design is the implementation• Good architects understand design; good designers understand architecture. g g g
You have to know both!
A Useful Analogy
• Building Architecture • Computer Architecture
– Sand, clay, wood, etc – Transistors, logic gates
– Bricks, timber, … – ALUs, flip-flops, bit cells, crossbars, …
– Compose them to form buildings – Compose them to form systems
Computers: Desktops
• Single-user systems like PCs and laptops• Large market: drives microprocessor innovation• Sensitive to price-performance and latency• Key consumer metric: clock rate
Computers: Servers
• Large-scale, shared system– Enterprise computing, web servers
• Made up of microprocessors too• Very sensitive to availability, scalability, and throughput• Key consumer metric: benchmarks
Computers: Embedded Systems
• Computers in devices where the presence of computers is not i di l b iimmediately obvious– Appliances (microwaves), Handheld devices (palmtops, cell
phones) Video game consoles Digital set tops Internet routersphones), Video game consoles, Digital set-tops, Internet routers, cars, …
• Wide range of processing power and costWide range of processing power and cost• Sensitive to price, power, and real-time performance
What you will learn in this course
• Instruction Set architecture• Pipelining• Pipelining
– Branches, hazards, forwarding, etc.– Superscalar
• Memory– Basic caches (how caches are built)– Direct-mapped set-associative caches– Direct-mapped, set-associative caches– Basic virtual memory– Disk operation
• Multiprocessor systems– Interconnection networks
• Improve architecture for specific applications/domains• Learn on your own
d d hi k– Read and think
Parallel system architecture
• Processor architecture– Multiple instructions issued simultaneously– Multiple instructions issued simultaneously
• Memory architecture– Multiple cache banks
M lti l– Multiple memory accesses• Network architecture
– Multiple processors and memories communicating• Multiprocessor architecture
– All the above• Why? Because today’s and tomorrow’s systems are parallel systems.
What you will learn
System Case Studies•IBM PowerPC•IBM PowerPC•Intel P6•Intel Itanium•Transmeta Crusoe•Intel XScale•Sony Playstation 2•Sony Playstation 2•Microsoft Xbox•Rambus DRDRAM•Intel Itanium•Google Blades •InfiniBand Clusters
To appreciate the architectures of these systems
InfiniBand Clusters•Alpha 21364•IBM Blue GeneI t l P ti 4•Intel Pentium 4
Major topics
• Topic 1: Processor: Superscalars [IBM PowerPC, Intel P6, Alpha 21264]Out of order Multi issue Pipelines– Out-of-order Multi-issue Pipelines
– Branches• Topic 2: Memory [RAMBUS, Alpha 21364, Sony PlayStation, Microsoft Xbox]
– Advanced caches– Virtual Memory– Memory TechnologyMemory Technology
• Topic 3: Multiprocessors [Intel Pentium 4, IBM Blue Gene, IBM Power 5]– Shared-Memory vs. Message Passing
M lti threading Sim ltaneo s M lti threading Cache coherence– Multi-threading, Simultaneous Multi-threading, Cache coherence• Topic 4: Networks [Google Blades, IBM Blue Gene, MIT Raw chip
multiprocessor]N t k I t f t C h C h– Network Interfaces to Cache Coherence
– Overview of network architecture• Topic 5: Compiler&Processor: VLIW [Intel Itanium]
– Static branch prediction– Instruction scheduling
Basic Concepts
• Instruction set architecture (next lecture)
• CPI: clock cycles per instructionCPI = clock cycles for a program / instruction count– Instructions per cycle (IPC) may also be used
• Speedup
• Compare performance of different computers? h k– Benchmarks
Execution time
The execution time of a single application can be estimated as:
Path Length × CPI × Cycle Time
Path length: number of instructions neededCPI: average number of cycles per instructionCycle time: length of each cycle
If you have multiple applications, be careful when using a single number toIf you have multiple applications, be careful when using a single number to characterize the processor’s performance.The single number may be arithmetic mean, geometric mean, or harmonic mean of the performance of multiple applications.performance of multiple applications.
Amdahl’s law
• The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be usedis limited by the fraction of the time the faster mode can be used
• Speedup is defined as:
imeExecutionT
Example:
tenhancemenafter
tenhancemenbefore
imeExecutionTimeExecutionT
Speedup_
_
Example: Using a new method, 40% of an application can execute 10 times faster. The overall speedup can be calculated as:
56.164.01
104.06.0
1
Speedup
The best speedup you can achieve by optimizing the 40% code is 1/0.6=1.6
10
Performance: What to measure
• Usually rely on benchmarks vs. real workloadsT i di bili ll i f b h k li i b h k• To increase predictability, collections of benchmark applications-- benchmark suites -- are popular
• SPECCPU: popular desktop benchmark suitep p p– CPU only, split between integer and floating point programs– SPEC INT2000 has 12 integer, SPE CFP2000 has 14 integer pgms– SPEC CPU2006 (August 2006): 12 int apps. and 17 fp apps– SPECSFS (NFS file server) and SPECWeb (WebServer) added as server
benchmarksbenchmarks• Transaction Processing Council measures server performance and cost-
performance for databases– TPC-C Complex query for Online Transaction Processing– TPC-H models ad hoc decision support
TPC W a transactional web benchmark– TPC-W a transactional web benchmark– TPC-App application server and web services benchmark
Measures
• Rate ? – Millions of floating point operations per second (mflops)– Tend to be misleading
• Time ? – Time is reciprocal to rate
A t t l f b h kAssume a total of n benchmarksMi is the performance measured in mflopsM = F / TMi = Fi / Ti
Arithmetic mean
• Unweighted arithmetic mean
n
i
i
nMmeanA
1_
• Weighted arithmetic mean
i n1
n
MwmeanA
i
iiMwmeanA1
_
Geometric Mean
• Unweighted geometric mean
nn
iiMmeanG
1
1
_
• Weighted geometric meani
n
wiMmeanG
• Consistently maintain the performance relationship regardless of
i
iiMmeanG
1
_
the basis.• Does not predict execution time. Thus, may lead to wrong
conclusions
Harmonic Mean(1)
• Unweighted harmonic mean
n
nmeanH1
_i iM1
1
• \Weighted harmonic mean
n
i
Mw
meanH 1_
• Should be used for summarizing performance expressed as a rate
i iM1
Should be used for summarizing performance expressed as a rate.• Equivalent to calculating the total number of operations divided by the total time.
nFnnH
n
ii
n
i
in
i i
TFT
M
meanH
111
1_
iii i 111
Summary of comparing multiple applications
• Use arithmetic mean for an measure of performance expressed as time
• Use harmonic mean for an measure of performance expressed as trate
• Do NOT use geometric mean for either of themA t f h ld b l l t d b f• An aggregate performance measure should be calculated beforeany normalizing is done
J. E. Smith, “Characterizing computer performance with a single number,” Communications of
ACM, vol. 31, no. 10, pp. 1202 – 1206, October 1988.