April 2011 Explain the Dynamic scheduling mechanism with an example The Dynamic Scheduling is used handle some cases when dependences are unknown at a compile time. In which the hardware rearranges the instruction execution to reduce the stalls while maintaining data flow and exception behavior. It also allows code that was compiled with one pipeline in mind to run efficiently on a different pipeline. For example, consider this code: DIV.D F0,F2,F4 ADD.D F10,F0,F8 SUB.D F12,F8,F14 The SUB.D instruction cannot execute because the dependence of ADD.D on DIV.D causes the pipeline to stall; yet SUB.D is not data dependent on anything in the pipeline. This hazard creates a performance limitation that can be eliminated by not requiring instructions to execute in program order. To allow us to begin executing the SUB.D in the above example, we must separate the issue process into two parts: checking for any structural hazards and waiting for the absence of a data hazard. We can still check for structural hazards when we issue the instruction; thus, we still use in-order instruction issue (i.e., instructions issue in program order), but we want an instruction to begin execution as soon as its data operand is available. Thus, this pipeline does out-of-order execution, which implies out-of-order completion. To allow out-of-order execution, we essentially split the ID pipe stage of our simple five-stage pipeline into two stages: • Issue—Decode instructions, check for structural hazards. • Read operands—Wait until no data hazards, then read operands. In a dynamically scheduled pipeline, all instructions pass through the issue stage in order (in-order issue); however, they can be stalled or bypass each other in the second stage (read operands) and thus enter execution out of order. Score-boarding is a technique for allowing instructions to execute out-of- order when there are sufficient resource s and no data dependences; it is named after the CDC 6600 scoreboard, which developed this capability. What is Flynn’s Taxonomy? The idea of using multiple processors both to increase performance and to improve availability was proposed by Flynn [1966] using simple model of categorizing all computers. He looked at the parallelism in the instruction and data streams called for by the instructions at the most
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
constrained component of the multiprocessor, and placed all computers into one of four
categories:
1. Single instruction stream, single data stream (SISD)—This category is the uniprocessor.
2. Single instruction stream, multiple data streams (SIMD)—The same instruction is executed bymultiple processors using different data streams. SIMD computers exploit data-level parallelism
by applying the same operations to multiple items of data in parallel. Each processor has its
own data memory (hence multiple data), but there is a single instruction memory and control
processor, which fetches and dispatches instructions. For applications that display significant
data-level parallelism, the SIMD approach can be very efficient. The multimedia extensions
discussed in Appendices B and C are a form of SIMD parallelism. Vector architectures are the
largest class of SIMD architectures.
3. Multiple instruction streams, single data stream (MISD)—No commercial multiprocessor of
this type has been built to date.
4. Multiple instruction streams, multiple data streams (MIMD)—Each processor fetches its own
instructions and operates on its own data. MIMD computers exploit thread-level parallelism,
since multiple threads operate in parallel. In general, thread-level parallelism is more flexible
than data-level parallel-ism and thus more generally applicable.
Explain the shared memory architectures?
Existing MIMD multiprocessors fall into two classes, depending on the number of processors
involved, which in turn dictates a memory organization and interconnect strategy. SharedMemory architectures are categorized into centralized-shared memory and distributed-shared
memory.
Centralized-Shared Memory – For multi-processors with small processor counts, it is possible
for the processors to share a single centralized memory. With large caches, a single memory,
possibly with multiple banks, can satisfy the memory demands of a small number of processors.
By using multiple point-to-point connections, or a switch, and adding additional memory banks,
a centralized shared-memory design can be scaled to a few dozen processors. Because there is
a single main memory that has a symmetric relationship to all processors and a uniform access
time from any processor, these multiprocessors are most often called symmetric (shared-
memory) multiprocessors (SMPs), and this style of architecture is sometimes called uniform
memory access (UMA), arising from the fact that all processors have a uniform latency from
memory, even if the memory is organized into multiple banks.
Distributed-Shared Memory – It consists of multiprocessors with physically distributed
memory. To support larger processor counts, memory must be distributed among the
processors rather than centralized; otherwise the memory system would not be able to sup-port the bandwidth demands of a larger number of processors without incurring excessively
long access latency.
Both direction networks (i.e., switches) and indirect networks (typically multi-dimensional
meshes) are used to interconnect large number of processors. Distributing the memory among
the nodes has two major benefits.
Advantages
• It is a cost-effective way to scale the memory bandwidth if most of the accesses are to
the local memory in the node.
• It reduces the latency for accesses to the local memory. These two advantages make
distributed memory attractive at smaller processor counts as processors get ever faster
and require more memory bandwidth and lower memory latency.
Disadvantages
A distributed-memory architecture are that communicating data between processors becomes
somewhat more complex, and that it requires more effort in the software to take advantage of
the increased memory bandwidth afforded by distributed memories.
Differentiate between shared memory and message passing architectures
o A torus has ring connections in each dimension, and is symmetric. An n x n binary torus
has node degree of 4 and a diameter of 2x[n/2]
Draw a cacheline and bus state diagram of MESI cache coherence protocol.
Explain with example?
The MESI protocol is a widely used cache coherency and memory coherence protocol. It is the most
common protocol which supports write-back cache. Its use in personal computers became widespread
with the introduction of Intel's Pentium processor to "support the more efficient write-back cache in
addition to the write-through cache previously used by the Intel 486 processor "
Every cache line is marked with one of the four following states
ModifiedThe cache line is present only in the current cache, and is dirty ; it has been modified from thevalue in main memory. The cache is required to write the data back to main memory at some
time in the future, before permitting any other read of the (no longer valid) main memory state.The write-back changes the line to the Exclusive state.
ExclusiveThe cache line is present only in the current cache, but is clean; it matches main memory. It maybe changed to the Shared state at any time, in response to a read request. Alternatively, it maybe changed to the Modified state when writing to it.
SharedIndicates that this cache line may be stored in other caches of the machine and is "clean" ; it
matches the main memory. The line may be discarded (changed to the Invalid state) at any time.
InvalidIndicates that this cache line is invalid.
For any given pair of caches, the permitted states of a given cache line are as follows:
In a typical system, several caches share a common bus to main memory. Each also has an attached CPUwhich issues read and write requests. The caches' collective goal is to minimize the use of the sharedmain memory.
A cache may satisfy a read from any state except Invalid. An Invalid line must be fetched (to the Sharedor Exclusive states) to satisfy a read.
A write may only be performed if the cache line is in the Modified or Exclusive state. If it is in the Sharedstate, all other cached copies must be invalidated first. This is typically done by a broadcast operationknown as Request For Ownership (RFO).
A cache may discard a non-Modified line at any time, changing to the Invalid state. A Modified line must
be written back first.
A cache that holds a line in the Modified state must snoop (intercept) all attempted reads (from all of the other caches in the system) of the corresponding main memory location and insert the data that itholds. This is typically done by forcing the read to back off (i.e. retry later), then writing the data to main
memory and changing the cache line to the Shared state.
A cache that holds a line in the Shared state must listen for invalidate for request-for-ownership
broadcasts from other caches, and discard the line (by moving it into Invalid state) on a match.
A cache that holds a line in the Exclusive state must also snoop all read transactions from all othercaches, and move the line to Shared state on a match.
The Modified and Exclusive states are always precise: i.e. they match the true cache line ownershipsituation in the system. The Shared state may be imprecise: if another cache discards a Shared line, thiscache may become the sole owner of that cache line, but it will not be promoted to Exclusive state.
Other caches do not broadcast notices when they discard cache lines, and this cache could not use suchnotifications without maintaining a count of the number of shared copies.
In that sense the Exclusive state is an opportunistic optimization: If the CPU wants to modify a cache linethat is in state S, a bus transaction is necessary to invalidate all other cached copies. State E enablesmodifying a cache line with no bus transaction.
What are instruction dependencies? Explain how these dependencies resolved
using pipeline stall mechanism in 3-stage pipeline?
(1) Data Hazards
(i) Caused by data (RAW, WAW, WAR) dependences
(ii) Require
(A) Pipeline interlock (stall) mechanism to detect dependences and generate machine stall
cycles
(i) Reg-id comparators between instrs in REG stage and instrs in EXE/WRB stages
(iii) Stalls due to RAW hazards can be reduced by bypass network
The VAX 8600 was introduced by Digital Equipment Corporation in 1985. This machine implements a
typical CISC architecture with micro programmed control. The instruction set contains about300 instructions with 20 different addressing modes. As shown in the Figure 5.2.3(a), the VAX 8600
executes the same instruction set, runs the same VMS operating system, and interfaces with the same I/
O buses (such as SBI and Unibus) as the VAX 11/780
The CPU in the VAX 8600 consists of two functional units for concurrent execution of integer and
floating-point instructions. The unified cache is used for holding both instructions and data. There are 16
GPRs in the instruction unit. Instruction pipelining has been built with six stages in the VAX 8600, as in
most CISC machines. The instruction unit prefetches and decodes instructions, handles branching
operations, and supplies operands to the two functional units in a pipelined fashion.
A translation look aside buffer (TLB) is used in the memory control unit for fast generation of a physical
address from a virtual address. Both integer and floating-point units are pipelined. The CPI of a VAX
8600 instruction varies within a wide range from 2 cycles to as high as 20 cycles. For example, both
multiply and divide may tie up the execution unit for a large number of cycles. This is caused by the useof long sequences of microinstructions to control hardware operations
What are reservation tables in pipelining? Mention its advantages?
• A reservation table displays the time-space flow of data through the pipeline for one function