CSE 502 Graduate Computer Architecture Lec 1-3 - Introduction. Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson, UC-Berkeley cs252-s06. Outline. Computer Science at a Crossroads - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CSE 502 Graduate Computer
Architecture
Lec 1-3 - Introduction
Larry WittieComputer Science, StonyBrook University
http://www.cs.sunysb.edu/~cse502 and ~lw
Slides adapted from David Patterson, UC-Berkeley cs252-s06
1/25,27 + 2/1/2010 1CSE502-S10, Lec 01-3 - intro
1/25,27 + 2/1/2010 CSE502-S10, Lec 01-3 - intro 2
Outline
• Computer Science at a Crossroads
• Computer Architecture v. Instruction Set Arch.
• How would you like your CSE502?
• What Computer Architecture brings to table– Quantitative Principles of Design
– Technology Performance Trends
– Careful, Quantitative Comparisons
1/25,27 + 2/1/2010 CSE502-S10, Lec 01-3 - intro 3
• Old Conventional Wisdom: Power is free, Transistors expensive• New Conventional Wisdom: “Power wall” Power expensive, Xtors free
(Can put more on chip than can afford to turn on)• Old CW: Can increase Instruction Level Parallelism more via compilers,
innovation (Out-of-order, speculation, VLIW, …)• New CW: “ILP wall” law of diminishing returns on more HW for ILP • Old CW: Multiplies are slow, Memory access is fast• New CW: “Memory wall” Memory slow, multiplies fast
(200 clock cycles to DRAM memory, 4 clocks for multiply)• Old CW: Uniprocessor performance 2X / 1.5 yrs• New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall
– Uniprocessor performance now 2X / 5(?) yrs
Sea change in chip design: multiple “cores” (2X processors per chip / ~ 2 years)
» Increase on-chip number of simple processors that are power efficient» Simple processor “cores” use less power per useful calculation done
• Multiprocessors imminent in 1970s, ‘80s, ‘90s, …• “… today’s processors … are nearing an impasse as
technologies approach the speed of light..”David Mitchell, The Transputer: The Time Is Now (1989)
• Transputer was premature Custom multiprocessors strove to lead uniprocessors Procrastination rewarded: 2X seq. perf. / 1.5 years
• “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing”
Paul Otellini, President, Intel (2004) • Difference is all microprocessor companies switch to
multiprocessors (AMD, Intel, IBM, Sun; all new Apples 2 CPUs)
Procrastination penalized: 2X sequential perf. / 5 yrs Biggest programming challenge: going from 1 to 2 CPUs
1/25,27 + 2/1/2010 CSE502-S10, Lec 01-3 - intro 7
Problems with Sea Change
• Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, … not ready to supply Thread Level Parallelism or Data Level Parallelism for 1000 CPUs / chip,
• Architectures not ready for 1000 CPUs / chip• Unlike Instruction Level Parallelism, cannot be solved by just by
computer architects and compiler writers alone, but also cannot be solved without participation of computer architects
• This edition of CSE 502 (and 4th Edition of textbook Computer Architecture: A Quantitative Approach) explores shift from Instruction Level Parallelism to Thread Level Parallelism / Data Level Parallelism
1/25,27 + 2/1/2010 CSE502-S10, Lec 01-3 - intro 8
Outline
• Computer Science at a Crossroads
• Computer Architecture v. Instruction Set Arch.
• How would you like your CSE502?
• What Computer Architecture brings to table– Quantitative Principles of Design
– Technology Performance Trends
– Careful, Quantitative Comparisons
1/25,27 + 2/1/2010 CSE502-S10, Lec 01-3 - intro 9
Instruction Set Architecture: Critical InterfaceThe computing system as seen by programmers
instruction set
software
hardware
• Properties of a good abstraction– Lasts through many generations (portability)– Used in many different ways (generality)– Provides convenient functionality to higher levels– Permits an efficient implementation at lower levels– What matters today is performance of complete computer systems
• Grad Students with too varied background?– You will have a difficult time if you have not had an
undergrad course using a Hennessy & Patterson text.
• Grads without CSE320 equivalent may have to work hard; Review: CSE502 text Appendix A, B, C; the CSE320 home page; and maybe CSE320 text Computer Organization and Design (COD) 3/e
– Read chapters 1 to 8 of COD if you never took the prerequisite
– If took a class, be sure COD Chapters 2, 6, 7 are very familiar
• We will spend 2 week-long lectures on review of Pipelining (App. A) and Memory Hierarchy (App. C), before an in-class quiz to check if everyone is OK.
• 18% Homeworks (practice for the exams)• 74% Exams:
{4% Quiz, 20% Midterm, 50% Final Exam}• 8% (Optional) Research Project (work in pairs)
– you need to show initiative– Pick a topic (more on this later)– give oral presentation or poster session– written report like a conference paper– 5 weeks work full-time for 2 people– opportunity to do “research in the small” to help make transition
from good undergrad student to research colleague
I may add up to 3% to a student’s final score, usually given only to people showing marked improvement during the course.
• The Principle of Locality:– Programs access a relatively small portion of the address space at any
instant of time.
• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will tend
to be referenced again soon (e.g., loops, reuse)
– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access)
• For 30 years, HW has relied on locality for memory perf.
• For disk, LAN, memory, and microprocessor, bandwidth improves by more than the square of latency improvement
– In the time that bandwidth doubles, latency improves by no more than 1.2X to 1.4X
• Lag of gains for latency vs bandwidth probably even larger in real systems, as bandwidth gains multiplied by replicated components
– Multiple processors in a cluster or even on a chip– Multiple disks in a disk array– Multiple memory modules in a large memory – Simultaneous communication in switched local area networks (LANs)
• HW and SW developers should innovate assuming Latency Lags Bandwidth
– If everything improves at the same rate, then nothing really changes – When rates vary, good designs require real innovation
Define and quantify power ( 1 / 2)• For CMOS chips, traditional dominant energy use has
been in switching transistors, called dynamic powerwitchedFrequencySVoltageLoadCapacitivePowerdynamic ×××=
22/1• For mobile devices, energy is a better metric
VoltageLoadCapacitiveEnergydynamic2
×=
• For a fixed task, slowing clock rate (the switching frequency) reduces power, but not energy
• Capacitive load is function of number of transistors connected to output and the technology, which determines the capacitance of wires and transistors
• Dropping voltage helps both, so ICs went from 5V to 1V• To save energy & dynamic power, most CPUs now turn
off clock of inactive modules (e.g. Fltg. Pt. Arith. Unit)• If a 15% voltage reduction causes a 15% reduction in
frequency, what is the impact on dynamic power?New power/old = 0.852 x 0.85 = 0.853 = 0.614 “39% reduction”
• If modules have exponentially distributed lifetimes (the age of a module does not affect its probability of failure), the overall failure rate (FIT) is the sum of failure rates of the modules
• Calculate FIT (rate) and MTTF (1/rate) for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF):
• Computer Science at a Crossroads• Computer Architecture v. Instruction Set Arch.• How would you like your CSE502?• Technology Trends: Culture of tracking,
anticipating and exploiting advances in technology
• Careful, quantitative comparisons:1. Define and quantify power2. Define and quantify dependability3. Define, quantify, and summarize relative
• To increase predictability, collections of benchmark applications, called benchmark suites, are popular
• SPECCPU: popular desktop benchmark suite– CPU only, split between integer and floating point programs– SPECint2000 had 12 integer, SPECfp2000 had 14 integer codes– SPEC CPU2006 has 12 integer benchmarks (CINT2006) and 17
floating-point benchmarks (CFP2006)– SPECSFS (NFS file server) and SPECWeb (WebServer) have been
added as server benchmarks
• Transaction Processing Council measures server performance and cost-performance for databases
– TPC-C Complex query for Online Transaction Processing– TPC-H models ad hoc decision support– TPC-W a transactional web benchmark– TPC-App application server and web services benchmark
• Standard deviation for SPECRatio of 1.98 for Itanium 2 is much higher-- vs. 1.40--so results will differ more widely from the mean, and therefore are likely less predictable
• SPECRatios falling within one standard deviation: – 10 of 14 benchmarks (71%) for Itanium 2– 11 of 14 benchmarks (78%) for Athlon
• Thus, results are quite compatible with a lognormal distribution (expect 68% for 1 StDev)
• Itanium 2 vs. Athlon St.Dev is 1.74, which is high, so less confidence in claim that Itanium 1.30 times as fast as Athlon
– Indeed, Athlon is faster on 6 of 14 programs
• Range is [0.75,2.27] with 11/14 inside 1 StDev (78%)
Computer Architecture >> ISA Comp. Arch. is an Integrated Design Approach
• Old pre-1980 definition of computer architecture: Computer Arch. = Instruction Set Architecture – Other aspects of computer design were called implementation – Insinuates implementation is uninteresting or less challenging
• Architect’s job much more than instruction set design; technical hurdles in computers today are more challenging than those in instruction set design
• Computer architecture is not just about transistors, individual instructions, or particular implementations– Original 1980s RISC projects replaced complex instructions with a
complex compiler and simple instructions
• What really matters today is the performance of complete computer systems – Hardware, runtime system, operating system, compiler, applications– In networking, this is called the “End to End argument”
• In the time that bandwidth doubles, latency improves by no more than a factor of 1.2 to 1.4
(capacity improves much faster than bandwidth, disk: 2500x vs 143x )
• Stated alternatively: Bandwidth improves by more than the square of the improvement in Latency(capacity improves much faster than cube of latency, disk: 2500x vs 8x )