ACA April2011 Solved

8/4/2019 ACA April2011 Solved

http://slidepdf.com/reader/full/aca-april2011-solved 1/15

April 2011

Explain the Dynamic scheduling mechanism with an example

The Dynamic Scheduling is used handle some cases when dependences are unknown at a compile time.

In which the hardware rearranges the instruction execution to reduce the stalls while maintaining data

flow and exception behavior. It also allows code that was compiled with one pipeline in mind to run

efficiently on a different pipeline.

For example, consider this code:

DIV.D F0,F2,F4

ADD.D F10,F0,F8

SUB.D F12,F8,F14

The SUB.D instruction cannot execute because the dependence of ADD.D on DIV.D causes the pipeline

to stall; yet SUB.D is not data dependent on anything in the pipeline. This hazard creates a performance

limitation that can be eliminated by not requiring instructions to execute in program order.

To allow us to begin executing the SUB.D in the above example, we must separate the issue process into

two parts: checking for any structural hazards and waiting for the absence of a data hazard. We can still

check for structural hazards when we issue the instruction; thus, we still use in-order instruction issue

(i.e., instructions issue in program order), but we want an instruction to begin execution as soon as its

data operand is available. Thus, this pipeline does out-of-order execution, which implies out-of-order

completion.

To allow out-of-order execution, we essentially split the ID pipe stage of our simple five-stage pipeline

into two stages:

• Issue—Decode instructions, check for structural hazards.

• Read operands—Wait until no data hazards, then read operands.

In a dynamically scheduled pipeline, all instructions pass through the issue stage in order (in-order

issue); however, they can be stalled or bypass each other in the second stage (read operands) and thus

enter execution out of order. Score-boarding is a technique for allowing instructions to execute out-of-

order when there are sufficient resources and no data dependences; it is named after the CDC 6600

scoreboard, which developed this capability.

What is Flynn’s Taxonomy?

The idea of using multiple processors both to increase performance and to improve availability

was proposed by Flynn [1966] using simple model of categorizing all computers. He looked at

the parallelism in the instruction and data streams called for by the instructions at the most



constrained component of the multiprocessor, and placed all computers into one of four

categories:

1. Single instruction stream, single data stream (SISD)—This category is the uniprocessor.

2. Single instruction stream, multiple data streams (SIMD)—The same instruction is executed bymultiple processors using different data streams. SIMD computers exploit data-level parallelism

by applying the same operations to multiple items of data in parallel. Each processor has its

own data memory (hence multiple data), but there is a single instruction memory and control

processor, which fetches and dispatches instructions. For applications that display significant

data-level parallelism, the SIMD approach can be very efficient. The multimedia extensions

discussed in Appendices B and C are a form of SIMD parallelism. Vector architectures are the

largest class of SIMD architectures.

3. Multiple instruction streams, single data stream (MISD)—No commercial multiprocessor of

this type has been built to date.

4. Multiple instruction streams, multiple data streams (MIMD)—Each processor fetches its own

instructions and operates on its own data. MIMD computers exploit thread-level parallelism,

since multiple threads operate in parallel. In general, thread-level parallelism is more flexible

than data-level parallel-ism and thus more generally applicable.

Explain the shared memory architectures?

Existing MIMD multiprocessors fall into two classes, depending on the number of processors

involved, which in turn dictates a memory organization and interconnect strategy. SharedMemory architectures are categorized into centralized-shared memory and distributed-shared

memory.

Centralized-Shared Memory – For multi-processors with small processor counts, it is possible

for the processors to share a single centralized memory. With large caches, a single memory,

possibly with multiple banks, can satisfy the memory demands of a small number of processors.

By using multiple point-to-point connections, or a switch, and adding additional memory banks,

a centralized shared-memory design can be scaled to a few dozen processors. Because there is

a single main memory that has a symmetric relationship to all processors and a uniform access

time from any processor, these multiprocessors are most often called symmetric (shared-

memory) multiprocessors (SMPs), and this style of architecture is sometimes called uniform

memory access (UMA), arising from the fact that all processors have a uniform latency from

memory, even if the memory is organized into multiple banks.



Distributed-Shared Memory – It consists of multiprocessors with physically distributed

memory. To support larger processor counts, memory must be distributed among the

processors rather than centralized; otherwise the memory system would not be able to sup-port the bandwidth demands of a larger number of processors without incurring excessively

long access latency.

Both direction networks (i.e., switches) and indirect networks (typically multi-dimensional

meshes) are used to interconnect large number of processors. Distributing the memory among

the nodes has two major benefits.

Advantages

• It is a cost-effective way to scale the memory bandwidth if most of the accesses are to

the local memory in the node.

• It reduces the latency for accesses to the local memory. These two advantages make

distributed memory attractive at smaller processor counts as processors get ever faster

and require more memory bandwidth and lower memory latency.

Disadvantages

A distributed-memory architecture are that communicating data between processors becomes

somewhat more complex, and that it requires more effort in the software to take advantage of

the increased memory bandwidth afforded by distributed memories.

Differentiate between shared memory and message passing architectures

Shared memory



• The physically separate memories can be addressed as one logically shared address

space.

• These multiprocessors are called distributed shared-memory (DSM) architectures.

• The same physical address on two processors refers to the same location in memory

• DSM multiprocessors are also called NUMAs (non-uniform memory access), since the

access time depends on the location of a data word in memory.

Message Passing

• The address space can consist of multiple private address spaces that are logically disjoint and

cannot be addressed by a remote processor.

• In such multiprocessors, the same physical address on two different processors refers to two

different locations in two different memories.

• Each processor-memory module is essentially a separate computer.

• A multiprocessor with multiple address spaces, communication of data is done by explicitly

passing messages among the processors. Therefore, these multiprocessors are often called

message-passing multiprocessors.

• Clusters inherently use message passing.

Write short notes on shared Virtual Memory and Static Interconnection Networks

Shared Virtual memory

A global virtual address space is shared among processors residing at large number of loosely coupled

processing nodes. The idea is to implement coherent shared memory on a network of processors

without physically shared memory. Each virtual address can be as large as single node can provide and is

shared by all nodes in the system. The SVM address space is organized in pages and can be accessed by

any node in the system. A memory mapping manager on each node views its local memory as large

cache of pages for its associated processors.

Static Interconnection Networks• Linear Array

o N nodes connected by n-1 links (not a bus). Segments between different pair of nodes

can be used in parallel

o Internal nodes have degree of 2. End nodes have degree of 1



o For small n, this is economical for large n, it is inappropriate

• Ring and Chordal Ring

o Like a linear array, but two nodes are connected by nth link; The ring can be uni- or bi-

directional.

o By adding additional links the node degree is increased and we obtain chordal ring.

o In the limit, we obtain fully-connected network, with node degree of n-1 and diameter

of 1

• Barrel Shifter

o Like a ring, but with additional links between all pair of nodes that have distance equal

to power of 2

o With a network of size N=2n, each node has degree d=2n-1, and network has diameterD = n/2

o Barrel shifter connectivity is greater than any chordal ring of lower node degree.

o Barrel shifter much less complex than fully interconnected network.

• Tree and Star

o A k-level completely balanced binary tree will have N=2k-1 nodes, with maximum node

degree of 3 and network diameter is 2(k-1).

o The balanced binary tree is scalable, since it has a constant maximum node degree.

o A star is a two-level tree with a node degree d = N-1 and a constant diameter of 2

• Fat Tree

o A fat tree is a tree in which number of edges between nodes increases closer to root.

o The edges represent communication channels, and since communication traffic

increases as root is approached, it seems logical to increase the number of channels

there.

• Mesh and Torus

o Pure Mesh – N = nk nodes with links between each adjacent pair of nodes in a row or

column. This is not a symmetric network; interior node degree d=2k, diameter = k(n-1)

o Illiac Mesh – Wraparound is allowed, thus reducing the network diameter to about half

that of the equivalent to pure mesh.



o A torus has ring connections in each dimension, and is symmetric. An n x n binary torus

has node degree of 4 and a diameter of 2x[n/2]

Draw a cacheline and bus state diagram of MESI cache coherence protocol.

Explain with example?

The MESI protocol is a widely used cache coherency and memory coherence protocol. It is the most

common protocol which supports write-back cache. Its use in personal computers became widespread

with the introduction of Intel's Pentium processor to "support the more efficient write-back cache in

addition to the write-through cache previously used by the Intel 486 processor "

Every cache line is marked with one of the four following states

ModifiedThe cache line is present only in the current cache, and is dirty ; it has been modified from thevalue in main memory. The cache is required to write the data back to main memory at some

time in the future, before permitting any other read of the (no longer valid) main memory state.The write-back changes the line to the Exclusive state.

ExclusiveThe cache line is present only in the current cache, but is clean; it matches main memory. It maybe changed to the Shared state at any time, in response to a read request. Alternatively, it maybe changed to the Modified state when writing to it.

SharedIndicates that this cache line may be stored in other caches of the machine and is "clean" ; it

matches the main memory. The line may be discarded (changed to the Invalid state) at any time.

InvalidIndicates that this cache line is invalid.

For any given pair of caches, the permitted states of a given cache line are as follows:

Operation



In a typical system, several caches share a common bus to main memory. Each also has an attached CPUwhich issues read and write requests. The caches' collective goal is to minimize the use of the sharedmain memory.

A cache may satisfy a read from any state except Invalid. An Invalid line must be fetched (to the Sharedor Exclusive states) to satisfy a read.

A write may only be performed if the cache line is in the Modified or Exclusive state. If it is in the Sharedstate, all other cached copies must be invalidated first. This is typically done by a broadcast operationknown as Request For Ownership (RFO).

A cache may discard a non-Modified line at any time, changing to the Invalid state. A Modified line must

be written back first.

A cache that holds a line in the Modified state must snoop (intercept) all attempted reads (from all of the other caches in the system) of the corresponding main memory location and insert the data that itholds. This is typically done by forcing the read to back off (i.e. retry later), then writing the data to main

memory and changing the cache line to the Shared state.

A cache that holds a line in the Shared state must listen for invalidate for request-for-ownership

broadcasts from other caches, and discard the line (by moving it into Invalid state) on a match.

A cache that holds a line in the Exclusive state must also snoop all read transactions from all othercaches, and move the line to Shared state on a match.

The Modified and Exclusive states are always precise: i.e. they match the true cache line ownershipsituation in the system. The Shared state may be imprecise: if another cache discards a Shared line, thiscache may become the sole owner of that cache line, but it will not be promoted to Exclusive state.

Other caches do not broadcast notices when they discard cache lines, and this cache could not use suchnotifications without maintaining a count of the number of shared copies.

In that sense the Exclusive state is an opportunistic optimization: If the CPU wants to modify a cache linethat is in state S, a bus transaction is necessary to invalidate all other cached copies. State E enablesmodifying a cache line with no bus transaction.

What are instruction dependencies? Explain how these dependencies resolved

using pipeline stall mechanism in 3-stage pipeline?

(1) Data Hazards

(i) Caused by data (RAW, WAW, WAR) dependences

(ii) Require

(A) Pipeline interlock (stall) mechanism to detect dependences and generate machine stall

cycles

(i) Reg-id comparators between instrs in REG stage and instrs in EXE/WRB stages

(iii) Stalls due to RAW hazards can be reduced by bypass network

(A) Reg-id comparators + data bypass paths + mux



(2) Structural Hazards

(i) Caused by resource constraints

(ii) Require pipeline stall mechanism to detect structural constraints

(3) Control (Branch) Hazards

(i) Caused by branches(ii) Instruction fetch of a next instruction has to wait until the target (including the branch

condition) of the current branch instruction need to be resolved

(iii) Use

(A) Pipeline stall to delay the fetch of the next instruction

(B) Predict the next target address (branch prediction) and if wrong, flush all the speculatively

fetched instructions from the pipeline

Define Amdahl’s law, CPI and Execution time

The performance gain that can be obtained by improving some portion of a computer can be calculated

using Amdahl’s Law. Amdahl’s Law states that the performance improvement to be gained from using

some faster mode of execution is limited by the fraction of the time the faster mode can be used.

Amdahl’s Law defines the speedup that can be gained by using a particular feature.

Speedup is the Ratio

Speedup= Performance for entire task using enhancement when possible / Performance for entire task

without using enhancement

Speedup= Execution Time for entire task without using enhancement / Execution Time for entire task

using enhancement when possible

Speedup tells us how much faster a task will run using the machine with the enhancement as opposed

to the original machine.

CPI – Is the average number of clock cycles per instruction calculated, if we know the number of clock

cycles and instruction count.

CPI is computed as

CPI = CPU clock cycles per program / Instruction count

Execution Time – If we know the Instruction path length (IC), CPI (clock cycles per instruction) and

clock cycle time then execution time is defined as

CPU Time = Instruction count X Clock cycle time X Cycles per instruction

Explain prefix computation and loop unrolling method with example



Loop Unrolling

A simple scheme for increasing the number of instructions relative to the branch and overhead

instructions is loop unrolling. Unrolling simply replicates the loop body multiple times, adjusting the loop

termination code.

Loop unrolling can also be used to improve scheduling. Because it eliminates the branch, it allows

instructions from different iterations to be scheduled together.

Example: Show our loop unrolled so that there are four copies of the loop body, assuming Rl - R2 (that

is, the size of the array) is initially a multiple of 32, which means that the number of loop iterations is a

multiple of 4. Eliminate any obviously redundant computations and do not reuse any of the registers.

Answer: Here is the result after merging the DADDUI instructions and dropping the unnecessary BNE

operations that are duplicated during unrolling. Note that R2 must now be set so that 32 (R2) is the

starting address of the last four elements.

Loop: L.D F0,0(R1)

ADD.D F4.F0.F2

S.D F4,0(R1) ;drop DADDUI & BNE

L.D F6,-8(R1)

ADD.D F8,F6,F2

S.D F8,-8(R1) ;drop DADDUI & BNE

L.D F10,-16(R1)

ADD.D F12,F10,F2

S.D F12,-16(R1) ;drop DADDUI & BNE

L.D F14,-24(R1)ADD.D F16,F14,F2

S.D F16,-24(R1)

DADDUI Rl,Rl,#-3 2

BNE Rl,R2,Loop

We have eliminated three branches and three decrements of Rl. The addresses on the loads and stores

have been compensated to allow the DADDUI instructions on Rl to be merged.

Explain Hardware and Software Parallelism



September 2010

Write a neat functional structure of SIMD array processor with concurrent scalar

processor in control unit

Functional Structure of SIMD Array Processor

Functional structure of MIMD Array Processor



Define Clock rate, CPI, MIPS rate and throughput rate

Clock Rate

CPU is driven by a clock with a constant cycle time. Cycle time is represented using –Ĩ in nanoseconds

• Inverse of cycle time is the clock rate

f = 1/ Ĩ in megahertz

• Size of a program is determined by its instruction count (Ic), in terms of the number of machine

instructions to be executed in the program

CPI – Is the average number of clock cycles per instruction calculated, if we know the number of clock

cycles and instruction count.

CPI is computed as

CPI = CPU clock cycles per program / Instruction count

MIPS:

One alternative to time as the metric is MIPS(Million Instruction Per Second)

MIPS=Instruction count/(Execution time x1000000).

This MIPS measurement is also called Native MIPS to distinguish it from some alternative definitions of

MIPS.

MIPS Rate: The rate at which the instructions are executed at a given time.



Throughput: The total amount of work done in a given time.

Throughput rate: The rate at which the total amount of work done at a given time.

Define Amdahl’s law and explain

The performance gain that can be obtained by improving some portion of a computer can be calculated

using Amdahl’s Law. Amdahl’s Law states that the performance improvement to be gained from using

some faster mode of execution is limited by the fraction of the time the faster mode can be used.

Amdahl’s Law defines the speedup that can be gained by using a particular feature.

Speedup is the Ratio

Speedup=Performance for entire task using enhancement when possible / Performance for entire task

without using enhancement

Alternatively,

Speedup=Execution Time for entire task without using enhancement / Execution Time for entire task

using enhancement when possible

Speedup tells us how much faster a task will run using the machine with the enhancement as opposed

to the original machine. Amdahl’s Law gives us a quick way to find the speedup from some

enhancement, which depends on two factors:

1. The fraction of the computation time in the original machine that can be converted to take advantage

of the enhancement—For example, if 20 seconds of the execution time of a program that takes 60

seconds in total can use an enhancement, the fraction is 20/60.

This value, which we will call Fraction enhanced, is always less than or equal to 1.

2. The improvement gained by the enhanced execution mode; that is, how much faster the task would

run if the enhanced mode were used for the entire program— This value is the time of the original mode

over the time of the enhanced mode: If the enhanced mode takes 2 seconds for some portion of the

program that can completely use the mode, while the original mode took 5 seconds for the same

portion, the improvement is 5/2. We will call this value, which is always greater than 1,

Speedupenhanced.

The execution time using the original machine with the enhanced mode will be the time spent using the

unenhanced portion of the machine plus the time spent using the enhancement:



Amdahl’s Law can serve as a guide to how much an enhancement will improve performance and how to

distribute resources to improve cost/performance.



Write the VAX 8600 CISC processor Architecture

The VAX 8600 was introduced by Digital Equipment Corporation in 1985. This machine implements a

typical CISC architecture with micro programmed control. The instruction set contains about300 instructions with 20 different addressing modes. As shown in the Figure 5.2.3(a), the VAX 8600

executes the same instruction set, runs the same VMS operating system, and interfaces with the same I/

O buses (such as SBI and Unibus) as the VAX 11/780

The CPU in the VAX 8600 consists of two functional units for concurrent execution of integer and

floating-point instructions. The unified cache is used for holding both instructions and data. There are 16

GPRs in the instruction unit. Instruction pipelining has been built with six stages in the VAX 8600, as in

most CISC machines. The instruction unit prefetches and decodes instructions, handles branching

operations, and supplies operands to the two functional units in a pipelined fashion.

A translation look aside buffer (TLB) is used in the memory control unit for fast generation of a physical

address from a virtual address. Both integer and floating-point units are pipelined. The CPI of a VAX

8600 instruction varies within a wide range from 2 cycles to as high as 20 cycles. For example, both

multiply and divide may tie up the execution unit for a large number of cycles. This is caused by the useof long sequences of microinstructions to control hardware operations

What are reservation tables in pipelining? Mention its advantages?

• A reservation table displays the time-space flow of data through the pipeline for one function

evaluation



• A static pipeline is specified by a single reservation table

• A dynamic pipeline may be specified by multiple reservation tables

• The number of columns in a reservation table is called the evaluation time of a given function.

• The checkmarks in a row correspond to the time instants (cycles) that a particular stage will be used.

• Multiple checkmarks in a row repeated usage of the same stage in different cycles

• Contiguous checkmarks extended usage of a stage over more than one cycle

• Multiple checkmarks in one columnmultiple stages are used in parallel

• A dynamic pipeline may allow different initiations to follow a mix of reservation table

ACA April2011 Solved

Documents