Solution Manual for Modern Processor Design by John Paul Shen and Mikko H. Lipasti This book emerged from the course Superscalar Processor Design, which has been taught at Carnegie Mellon University since 1995. Superscalar Processor Design is a mezzanine course targeting seniors and first-year graduate students. Quite a few of the more aggressive juniors have taken the course in the spring semester of their junior year. The prerequisite to this course is the Introduction to Computer Architecture course. The objectives for the Superscalar Processor Design course include: (1) to teach modem processor design skills at the microarchitecture level of abstraction; (2) to cover current microarchitecture techniques for achieving high performance via the exploitation of instruction-level parallelism (ILP); and (3) to impart insights and hands-on experience for the effective design of contemporary high-performance microprocessors for mobile, desktop, and server markets. In addition to covering the contents of this book, the course contains a project component that involves the microarchitectural design of a future-generation superscalar microprocessor. Here, in next successive posts, I am going to post solutions for the same Text-book (Modern Processor Design by John Paul Shen and Mikko H. Lipasti). If you find any difficulty or wants to suggest anything, feel free to comment...:) Link: http://targetiesnow.blogspot.in/p/solution-manual-for-modern- processor.html Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Exercise 1.6 and 1.7 Solution
27
Embed
Full Solution Manual for Modern Processor Design by John Paul Shen and Mikko H. Lipasti
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Solution Manual for Modern Processor Design by John Paul Shen and Mikko H. Lipasti
This book emerged from the course Superscalar Processor Design, which has been taught at
Carnegie Mellon University since 1995. Superscalar Processor Design is a mezzanine course targeting
seniors and first-year graduate students. Quite a few of the more aggressive juniors have taken the
course in the spring semester of their junior year. The prerequisite to this course is the
Introduction to Computer Architecture course. The objectives for the Superscalar Processor Design
course include: (1) to teach modem processor design skills at the microarchitecture level of
abstraction; (2) to cover current microarchitecture techniques for achieving high performance
via the exploitation of instruction-level parallelism (ILP); and (3) to impart insights and hands-on
experience for the effective design of contemporary high-performance microprocessors for mobile,
desktop, and server markets. In addition to covering the contents of this book, the course contains
a project component that involves the microarchitectural design of a future-generation superscalar
microprocessor.
Here, in next successive posts, I am going to post solutions for the same Text-book (Modern
Processor Design by John Paul Shen and Mikko H. Lipasti). If you find any difficulty or wants to
Ex 1.8, 1.9 and 1.10 Solution: Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Q.1.8: Recent processors like the Pentium 4 processors do not implement single-cycle shifts.
Given the scenario of Problem 7, assume that s = 50% of the additional integer and shift
instructions introduced by strength reduction are shifts, and shifts now take
four cycles to execute. Recompute the cycles per instruction and overall program speedup. Is
strength reduction still a good optimization?
Q.1.9: Given the assumptions of Problem 8, solve for the break-even ratio s (percentage of
additional instructions that are shifts). That is, find the value of s (if any) for which program
performance is identical to the baseline case without strength reduction (Problem 6).
Q.1.10: Given the assumptions of Problem 8, assume you are designing the shift unit on the
Pentium 4 processor. You have concluded there are two possible implementation options for the
shift unit: 4-cycle shift latency at a frequency of 2 GHz, or 2-cycle shift latency at 1.9 GHz. Assume
the rest of the pipeline could run at 2 GHz, and hence the 2-cycle shifter would set the entire
processor’s frequency to 1.9 GHz. Which option will provide better overall performance?
Ex. 3.13 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution Manual Q.3.13: Consider a processor with 32-bit virtual addresses, 4KB pages and 36-bit physical
addresses. Assume memory is byte-addressable (i.e. the 32-bit VA specifies a byte in memory).
L1 instruction cache: 64 Kbytes, 128 byte blocks, 4-way set associative, indexed and tagged with
virtual address.
L1 data cache: 32 Kbytes, 64 byte blocks, 2-way set associative, indexed and tagged with physical
address, write-back.
4-way set associative TLB with 128 entries in all. Assume the TLB keeps a dirty bit, a reference bit,
and 3 permission bits (read, write, execute) for each entry.
Specify the number of offset, index, and tag bits for each of these structures in the table below.
Also, compute the total size in number of bit cells for each of the tag and data arrays.
Q.3.16: Assume a two-level cache hierarchy with a private level one
instruction cache (L1I), a private level one data cache (L1D), and a shared
level two data cache (L2). Given local miss rates for the 4% for L1I, 7.5%
for L1D, and 35% for L2, compute the global miss rate for the L2 cache.
Q.3.17: Assuming 1 L1I access per instruction and 0.4 data accesses
per instruction, compute the misses per instruction for the L1I, L1D, and L2 caches of Problem 16.
Q.3.18: Given the miss rates of Problem 16, and assuming that
accesses to the L1I and L1 D caches take one cycle, accesses to the L2 take 12 cycles, accesses to main memory take 75 cycles, and a clock rate of 1GHz, compute the average memory reference latency for this cache
hierarchy.
Q.3.19: Assuming a perfect cache CPI (cycles per instruction) for a
pipelined processor equal to 1.15 CPI, compute the MCPI and overall CPI
for a pipelined processor with the memory hierarchy described in Problem 18 and the miss rates and access rates specified in Problem 16 and Problem 17.
Ex 4.8 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution manual Q.4.8: In an in-order pipelined processor, pipeline latches are used to hold result operands
from the time an execution unit computes them until they are written back to the register file
during the writeback stage. In an out-of-order processor, rename registers are used for the same
purpose. Given a four-wide out-of-order processor TYP
pipeline, compute the minimum number of rename registers needed
to prevent rename register starvation from limiting concurrency. What happens to this number if
frequency demands force a designer to add five extra pipeline stages between
dispatch and execute, and five more stages between execute and retire/writeback?
Ex 5.1 and 5.2 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution manual Q.5.1: The displayed code that follows steps through the elements of two arrays (A[] and B[])
concurrently, and for each element, it puts the larger of the two values into the corresponding
element of a third array (C[]). The three arrays are of length N.
The instruction set used for Problems 5.1 through 5.6 is as follows:
Ex 5.7 through 5.13 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution manual
Q.5.7 through Problem 5.13: Consider the following code segment within a
loop body for problems 5: if (x is even) then (branch b1) increment a (b1 taken) if (x is a multiple of 10) then (branch b2) increment b (b2 taken)
Assume that the following list of 9 values of x is to be processed by 9 iterations of this loop.
8, 9, 10, 11, 12, 20, 29, 30, 31
Note: assume that predictor entries are updated by each dynamic branch before the next dynamic branch accesses the predictor (i.e., there is no update delay).
Q.5.7: Assume that an one-bit (history bit) state machine (see above) is used as the
prediction algorithm for predicting the execution of the two branches in this loop. Indicate the predicted and actual branch directions of the b1 and b2 branch instructions for each iteration of this loop. Assume initial state of 0, i.e., NT, for the predictor.
Q.5.8: What are the prediction accuracies for b1 and b2?
Q.5.9: What is the overall prediction accuracy?
Q.5.10: Assume a two-level branch prediction scheme is used. In addition to the one-
bit predictor, a one bit global register (g) is used. Register g stores the direction of the last branch
executed (which may not be the same branch as the branch currently being
predicted) and is used to index into two separate one-bit branch history tables (BHTs) as
shown below. Depending on the value of g, one of the two BHTs is selected and used to
do the normal one-bit prediction. Again, fill in the predicted and actual branch directions
of b1 and b2 for nine iterations of the loop. Assume the initial value of g = 0, i.e., NT.
For each prediction, depending on the current value of g, only one of the two BHTs is
accessed and updated. Hence, some of the entries below should be empty.
Note: assume that predictor entries are updated by each dynamic branch before the next
dynamic branch accesses the predictor (i.e. there is no update delay).
Q.5.11: What are the prediction accuracies for b1 and b2?
Q.5.12: What is the overall prediction accuracy?
Q.5.13: What is the prediction accuracy of b2 when g = 0? Explain why.
Exercise 5.14 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution manual
Q.5.14: Below is the control flow graph of a simple program. The CFG is annotated with three
different execution trace paths. For each execution trace circle which branch predictor (bimodal, local, or Gselect) will best predict the branching behavior of the given trace. More than one predictor may perform equally well on a particular trace. However, you are to use each of the three predictors exactly once in choosing the best predictors for the three traces. Circle your choice for each of the three traces and add. (Assume each trace is executed many times and every node in the CFG is a conditional branch. The branch history register for the local, global, and Gselect predictors is limited to 4 bits.)
Ex 7.10, 7.11, 7.12, 11.1, 11.8 & 11.10 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution Manual
Q.7.10: If the P6 microarchitecture had to support an instruction set that included
predication, what effect would that have on the register renaming process?
Q.7.11: As described in the text, the P6 microarchitecture splits store operations into a STA
and STD pair for handling address generation and data movement. Explain why this makes
sense from a microarchitectural implementation perspective.
Q.7.12: Following up on Problem 7, would there be a performance benefit (measured in
instructions per cycle) if stores were not split? Explain why or why not?
Q.11.1: Using the syntax in Figure 11-2, show how to use the load-linked/store
conditional primitives to synthesize a compare-and-swap operation.
Q.11.8: Real coherence controllers include numerous transient states in addition to the
ones shown in Figure to support split-transaction buses. For example, when a processor issues a bus read for an invalid line (I), the line is placed in a IS transient state until the processor has received a valid data response that then causes the line to transition into shared state (S). Given a split-transaction bus that separates each bus command (bus read, bus write, and bus upgrade) into a request and response, augment the state table and state transition diagram
of Figure to incorporate all necessary transient states and bus responses. For simplicity, assume that any bus command for a line in a transient state gets a negative acknowledge (NAK) response that forces it to be retried after some delay.
Q.11.10: Assuming a processor frequency of 1 GHz, a target CPI of 2, a per-instruction level-
2 cache miss rate of 1% per instruction, a snoop-based cache coherent system with 32 processors,
and 8-byte address messages (including command and snoop
addresses)compute the inbound and outbound snoop bandwidth required at each processor node.
The Pentium Chronicles: The People, Passion, and Politics Behind Intel's
Landmark Chips (Practitioners) by Robert P. Colwell
o Back Matter
Architecture:
John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, third edition, Morgan Kaufmann, New York, 2003. Seewww.mkp.com/CA3.
David A. Patterson and John L. Hennessy, Computer Organization and Design: The Hardware Interface. Text for COEN 171.
Gerrit A. Blaauw and Frederick P. Brooks, Jr., Computer Architecture: Concepts and Evolution, Addison Wesley, 1997.
William Stallings, Computer Organization and Architecture, Prentice Hall, 2000.
Miles J. Murdocca and Vincent P. Heuring, Principles of Computer Architecture, Prentice Hall, 2000.
John D. Carpinelli, Computer Systems Organization and Architecture, Addison Wesley, 2001.
John Paul Shen and Mikko H. Lipasti, Modern Processor Design: Fundamentals for Superscalar Processors, McGraw-Hill, 2003. Highly recommended as a complement to Hennessey and Patterson.
UNIT I
Introduction: review of basic computer architecture, quantitative techniques in computer
design,
measuring and reporting performance. CISC and RISC processors.