Transcript
Pipelining
Advanced Computer Architecture
Pipeline Design• Instruction Pipeline Design
– Instruction Execution Phases– Mechanism for Instruction Pipelining– Dynamic Instruction Scheduling– Branch Handling Techniques
• Arithmetic Pipeline Design– Computer Arithmetic Principles– Static Arithmetic Pipelines– Multifunctional Arithmetic Pipelines
Typical Instruction Pipeline
• A typical instruction execution includes a sequence of operations which includes:-– Instruction Fetch (F)– Decode (D)– Operand Fetch or Issue (I)– Execute, several stages (E)– Write Back (W)
Source: Kai Hwang
Instruction Execution Phases
• Each operation (F, D, I, E, W) may require one clock cycle or more. Ideally, these operations need to be overlapped.
• Example (assumptions):-– load and store instructions take four cycles– add and multiply instructions take three cycles
Shaded regions indicate idle cycles due to dependenciesSource: Kai Hwang
Mechanisms for Instruction Pipelining
• Goal: Achieve maximum parallelism in pipeline by smoothening the instruction flow and minimizing the idle cycles
• Mechanisms:-– Prefetch Buffers– Multiple Functional Units– Internal Data Forwarding– Hazard Avoidance
Prefetch Buffers
• Used to match the instruction fetch rate to the pipeline consumption rate
• In a single memory access, a block of consecutive instructions are fetched into a prefetch buffer
• Three types of prefetch buffers:-– Sequential buffers, used to store sequential instructions– Target buffers, used to store branch target instructions– Loop buffer, used to store loop instructions
Source: Kai Hwang
Multiple Functional Units
• At times, a specific pipeline stage becomes the bottleneck
• Identified by large number of checksin a row in reservation table
• To resolve dependencies, we use reservation stations
• Each RS is uniquely identified with a tag monitored by tag unit (Register Tagging)
• Helps in conflict resolutionand serving as buffer
Source: Kai Hwang
Internal Data Forwarding
• Goal: Memory access operations to be replaced with register transfer operations
• Types:-– Store load forwarding– Load load forwarding– Store store forwarding
Source: Kai Hwang
Hazard Avoidance
• Read/write of shared variables by different instructions in pipeline may lead to different results if instructions are executed out of order
• Types:-– Read after Write (RAW) Hazard– Write after Write (WAW) Hazard– Write after Read (WAR) Hazard
Source: Kai Hwang
Instruction Scheduling
• Aim: To schedule instructions through an instruction pipeline
• Types of instruction scheduling:-– Static Scheduling
• Supported by optimizing compiler
– Dynamic Scheduling• Achieved by Tomasulo’s register-tagging
scheme• Using scoreboarding scheme
Static Scheduling
• Data dependency in a sequence of instructions create interlocked relationships
• Interlocking can be resolved by compiler by increasing separation between interlocked instructions
• Example:
Two independent load instructionscan be moved ahead so that spacingbetween them and multiply instructionis increased.
Tomasulo’s Algorithm
• Hardware dependent scheme• Data operands are saved in
Register Station (RS) untildependencies get resolved
• Register tagging is used toallocate/deallocate register
• All working registers are tagged
Source: Kai Hwang
Scoreboarding
• Multiple functional units appear in multiple execution pipelines. Parallel units allow instruction to execute out of order w.r.t. original program sequence.
• Processor has instruction buffers, instructions are issues regardless of the availability of their operands.
• Centralized control units called scoreboard is used to keep track of unavailable operands for instructions stored in buffer
Source: Kai Hwang
Branch Handling Techniques
• Pipeline performance is limited by presence of branch instructions in program
• Various branch strategies are applied to minimize performance degradation
• To evaluate branch strategy, two approaches can be followed– Trace data approach– Analytical analysis
• Effect of branching … contd.
Branching Illustrated
• Ib: Branch Instruction
• Once branch taken is decided,all instructions are flushed
• Subsequently, all theinstructions at branchtarget are run
Source: Kai Hwang
Effect of Branching
• Nomenclature:– Branch Taken, action of fetching non-sequential
(remote) instructions after branch instruction– Branch Target, (remote) instruction to be executed
after branch taken– Delay Slot (b), number of pipeline cycles
consumed between branch taken and branch target
• In general,0 <= b <= k-1
where k is number of pipeline stages
Effect of Branching
• When branch taken occurs, all instruction after branch instruction become useless, pipeline is flushed, loosing number of cycles
• Let Ib be branch instruction, then branch taken shall cause all instructions from Ib+1 till Ib+k-1 to be drained from pipeline
• Let p be probability of instruction to be branch instruction and q be probability of branch taken, then penalty, in terms of time is expressed as Tpenalty = pqnbt , where
n: number of instructions; b: number of pipeline cycles consumed; t: cycle time
• Effective execution time becomes Teff = kt + (n-1)t + pqnbt
Branch Prediction
• Branch can be predicted based on– Static Branch Strategy
• Probability of branch with respect to a particular branch type can be used to predict branch
• Probability may be obtained by collecting frequency of branch taken and branch types across large number of program traces
– Dynamic Branch Strategy• Uses limited recent branch history to predict
whether or not branch will be taken when it occurs next time
Branch Prediction Internals
• Branch prediction buffer– Used to store the branch historyinformation in order to make branchprediction
• State transition diagram used in dynamic branch prediction
Source: Kai Hwang
Delayed Branches
• Branch penalty can be reduced by the concept of delayed branch
• The central idea is to delay the execution of branch instruction to accommodate independent* instructions– Delaying by d cycles allows few useful
instructions (independent*) of branch instructions to be executed
* Execution of these instructions should be independent of outcome of branch instruction
Linear Pipeline Processors
• A linear pipeline processor is constructed with k processing stages i.e. S1 … Sk
• These stages are linearly connected to perform a specific function
• Data stream flows from one end of the pipeline to another end, external inputs are fed into S1 and final results move out from Sk , intermediate results pass from Si to Si+1
• Linear pipelining applied to:-– Instruction execution– Arithmetic computation– Memory access operations
Asynchronous Model
• Data flow between adjacent stages is controlled by handshaking protocol– When a stage Si is ready to transmit, it sends a
ready signal to stage Si+1
– This is followed by the actual data transfer
– After stage Si+1 receives the data, it returns an acknowledge signal to Si
Source: Kai Hwang
Contd…
• Asynchronous pipelines are useful in designing communication channels in message passing multicomputers
• They have variable throughput rate.• Different amount of delays may be
experienced in different stages.
Synchronous Model
• Clocked latches are used to interface between stages– Latches are master-slave flip flops that isolate
inputs from outputs. Upon arrival of a clock pulse, all latches transfer data to next stage at same time.
• Pipeline stages are combinational circuits.
Source: Kai Hwang
Contd…
• It is desirable to have equal delays in all stages.
• These delays determine the clock period and thus the speed of the pipeline.
Reservation Table
• It specifies the utilization pattern of successive stages in a synchronous pipeline
• Space time graph depicting precedence relationship in using the pipeline stages
• For a k-stage linear pipeline clock cycles are needed for data to flow through the pipeline
Clocking and Timing Control
• Clock cycle and throughput:– Clock cycle time (t) of a pipeline is given below
t = tm + d
where
tm denote maximum stage delay
d denote latch delay
– Pipeline frequency (1/t) is referred as throughput of the pipeline
• Clock skewing:– Ideally clock pulses should arrive at all stages at same time, but
due to clock skewing, same clock pulse may arrive at different stages with an offset of s
– Further, let tmax be time delay of longest logic path in a stage and tmin be that of shortest logic path in a stage, then
d + tmax + s <= t <= tm + tmin - s
Speedup
• Case 1: Pipelined processor– Ideally, number of clock cycles required by a k stage
pipeline to process n tasks is:- Np = k + (n-1)
(k clock cycles for first task & 1 clock cycle for each of n-1 tasks)
– Total time required is Tk = (k+(n-1))t
• Case 2: Non-pipelined processor– Non-pipelined processor would take time, T1 = nkt
• Speedup Factor: Sk of a k-stage pipeline over an equivalent non pipelined processor is:
Sk = T1 / Tk = nkt / (k+ (n-1))t = nk / (k + n-1))
Optimal number of stages:
• Most pipelining is staged as functional level with 2≤k≤15.
• Very few pipelines are designed to exceed 10 stages in real computers.
• Optimal choice of number of pipeline stages should be able to maximize the performance/cost ratio for target processing load.
• Performance/cost ratio(PCR)=f/c+kh• =1/(t/k+d)(c+kh)
where f=1/(t/k+d)• PCR corresponds to the optimal choice for the
number of desired pipeline stages: k0 =√t.c/d.h,where t is the total flow-through delay of pipeline,c is total stage cost,d is latch delay and latch cost h.
Efficiency & Throughput
• Efficiency: It is defined as speedup factor divided by number of stages:-
Ek = Sk / k = n / (k + (n-1))
• Pipeline Throughput: It is defined as number of tasks per unit time as below:-
Hk = n / (k + (n-1))t = nf / (k + (n-1))
top related