1 CHAPTER 1 Introduction: What is Pipelining? Definition: •In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one •An instruction pipeline is a technique used in the design of computer and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time The fundamental idea is to split the processing of a computer instruction into a series ofindependent steps, with storage at the end of each step. This allows the computer's control circuitry to issue instructions at the processing rate of the slowest step, which is much faster than the time needed to perform all steps at once. The term pipeline refers to the fact that each step is carrying data at once (like water), and each step is connected to the next (like the links of a pipe.) Most modern CPUs are driven by a clock. The CPU consists internally of logic and r egister (flipflops). When the clock signal arrives, the flip flops take their new value and the logic then requires a period of time to decode the new values. Then the next clock pulse arrives and the flip flops again take their new values, and so on. By breaking the logic into smaller pieces and inserting flip flops between the pieces of logic, the delay before the logic gives valid outputs is reduced. In this way the clock period can be reduced. For example, the classic RISC pipeline is broken into four stages with a set o f flip flops between each stage. 1.Instruction fetch 2.Instruction decode and register fetch 3.Execute 4.Memory access & Register write backWhen a programmer (or compiler) writes assembly code, they make the assumption that each instruction is executed before execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this causes a program to behave incorrectly, the situation is known as a hazard. Various techniques for resolving hazards such as forwarding and stalling exist. A non-pipeline architecture is inefficient because some CPU components (modules) are idle while another module is active during the instruction cycle. Pipelining does not completely
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
• In computing, a pipeline is a set of data processing elements connected in series, so that
the output of one element is the input of the next one
• An instruction pipeline is a technique used in the design of computer and other digital
electronic devices to increase their instruction throughput (the number of instructions that
can be executed in a unit of time
The fundamental idea is to split the processing of a computer instruction into a series of
independent steps, with storage at the end of each step. This allows the computer's control
circuitry to issue instructions at the processing rate of the slowest step, which is much faster than
the time needed to perform all steps at once. The term pipeline refers to the fact that each step is
carrying data at once (like water), and each step is connected to the next (like the links of a pipe.)
Most modern CPUs are driven by a clock. The CPU consists internally of logic and register
(flipflops). When the clock signal arrives, the flip flops take their new value and the logic then
requires a period of time to decode the new values. Then the next clock pulse arrives and the flip
flops again take their new values, and so on. By breaking the logic into smaller pieces andinserting flip flops between the pieces of logic, the delay before the logic gives valid outputs is
reduced. In this way the clock period can be reduced. For example, the classic RISC pipeline is
broken into four stages with a set of flip flops between each stage.
1. Instruction fetch
2. Instruction decode and register fetch
3. Execute
4. Memory access & Register write back
When a programmer (or compiler) writes assembly code, they make the assumption that each
instruction is executed before execution of the subsequent instruction is begun. This assumption
is invalidated by pipelining. When this causes a program to behave incorrectly, the situation is
known as a hazard. Various techniques for resolving hazards such as forwarding and stalling
exist.
A non-pipeline architecture is inefficient because some CPU components (modules) are idle
while another module is active during the instruction cycle. Pipelining does not completely
Each sub task is to be handled by a separate processing station / stage.
1.Time Overlapped Usage of Different Processing Stations in an Assembly Line
( General Observations) – 2
4. Each of the Processing Stages should have some Input to Work on in order to keep that unit
Busy as often as possible.
5. Each of the Processing Stages except the very last one is generating some Output to beconsumed by the next Processing Stage only.
6. Each of these Processing Stages may not take same time and also need not be synchronized.
Hence there will have to be some intermediate store / buffer to hold temporarily the Inputs to any
particular processing station.
Time Overlapped Usage of Different Processing Stations in an Assembly Line
( General Observations) – 3
7.Since each processing stage is dependent on it‘s predecessor processing stage only as well as
feeding to it‘s next processing stage only hence one cannot reduce the processing time for any
particular task/job lower than the slowest processing stage‘s processing time.
8.Each task normally passes through each of the processing stages regardless of the requirement9.Hence time taken to process a single task may increase as compared to the case where the
given task is processed based on it‘s specific requirements since a task may have to go through
some unnecessary stages.
Time Overlapped Usage of Different Processing Stations in an Assembly Line
( General Observations) – 4
10. System Throughput i.e. the number of tasks completed over a specific period of time will
increase because of Time Overlapped operation of the various processing stages.
11 However , if during the course of Processing , if any of the Processing Stage Fails / Stalls
then the entire Assembly Line will either crash OR get stalled.
Using Pipeline Inside a Computer
( Salient Queries - 1)
1.How this Assembly Line Concept is applicable in the Instruction Processing in a typical
Computer ?
Ans . a). The CPU of any Computer essentially fetches , decodes and then executes Instructions
belonging to a Program.
b) Each Instruction Processing is composed of an almost identical set of stages / Machine Cycle.
Hence one can view the CPU to represent an Assembly Line for Instruction Processing.
Using Pipeline Inside a Computer
( Salient Queries - 2)2. Is the improved Throughput i.e. number of tasks completed over a period of time dependent
on / proportional to the number of processing / PIPELINE stages ?
To be answered later in the context of Instruction Processing in a Computer.
The Typical Instruction Handling Sequence in a CPU
Typical Instruction Processing Stages Inside a CPU- 1
1.Fetch Instruction Op-Code [CISC ] / The Entire Instruction [ RISC ] from Instruction-Cache/
Memory into the Instruction Register using Instruction Pointer / PC appended by Code Segment
Register, as well as Update the Instruction Pointer / PC to point to next Instruction.[ IF] .
2.Decode Instruction Op-Code Inside the CPU and select some Register Operands [ RISC] , (In
this case Instruction Pointer / PC can be used to fetch the next Instruction ) or decide on future
Operand Address Reads as well as the next Instruction Location as in CISC. Update PC
Accordingly [ ID ]
Typical Instruction Processing Stages Inside a CPU- 2
3.Read Operand Addresses into the Instruction Register from I-Cache using the Instruction
Memory Address Register [ CISC only] [ROA] May have to be carried out a number of times
once for each of the Operand Addresses. (Optional) Not required for RISC.4.Execute Instruction Processing Op Code / Calculate Linear Operand Address Offset using
ALU [EX] . In the former case (processing) the operation may vary in time depending on the
type of Operation being carried out.
Typical Instruction Processing Stages Inside a CPU – 3
5.Read operand Values from Data -Cache / Memory using the computed Linear Offset as
obtained in the previous step appended by the appropriate Segment Registers
( DATA / STACK / EXTRA) . [ MEM]
N.B: For CISC the above two steps 4 & 5 may need to be executed a Number of times once each
for reading each of the Operand Addresses and at least once for performing computation. This
Computation Time need not be fixed.
Typical Instruction Processing Stages Inside a CPU - 4
6.Write Back Result [ Into the Designated Destination ]
[ WB ] . In case of a Memory being the destination the processor needs to compute the Linear
7.Interrupt Handling : Here main issues being two fold namely preserving the Current Context in
the System Stack followed by computing / locating the Target and loading it to the Instruction
Pointer.
One can Time Overlap these operations provided
A. There is no Resource Conflict among the various stages [ No Structural Hazards ]
B. Each Instruction once in the Pipeline does in no way affect the Execution pattern of any of it‘sSuccessor Instructions in the Pipeline [ There exists no Inter Instruction Dependency in the form
It is a measure of how many of the Instructions in a Computer Program can be executed
simultaneously [ In a Time Overlapped Fashion ] without violating the various Inter –
Instruction Dependencies that may exist.
Consider the following program:
I#1. e = a + bI#2. f = c + d
I#3. g = e * f
Instruction I#3 depends on the results of Instruction I#1 as well as on Instruction I#2 [ True
(Data) [RAW] Dependency ]
However, instructions I#1 and I#2 do not depend on any other Instruction , so they can be
Executed simultaneously.
If we assume that each Instruction can be completed in one unit of time then these three
instructions can be completed in a total of two units of time, giving an ILP of 3/2.
Goal & Motivation to achieve Speed Up:
Ordinary programs are typically written under a sequential execution model where instructionsexecute one after the other and in the order specified by the programmer.
ILP allows the compiler and the processor to overlap the execution of multiple instructions or
even to change the order in which instructions are executed.
A goal of compiler and processor designers is to identify and take advantage of as much ILP as
possible in a Specified Sequential Code.
How much ILP exists in programs is very application specific. In certain fields, such as graphics
[ Manipulation of Individual Pixels in a Group ]and scientific computing [ Matrix Multiplication]
the amount can be very large. However, workloads such as cryptography exhibit much less
parallelism because of the inherent RAW Data Dependency among the constituent Operations
Ordinary programs are typically written under a sequential execution model where instructions
execute one after the other and in the order specified by the programmer.
ILP allows the compiler and the processor to overlap the execution of multiple instructions or
even to change the order in which instructions are executed.
A goal of compiler and processor designers is to identify and take advantage of as much ILP as
possible in a Specified Sequential Code.
How much ILP exists in programs is very application specific. In certain fields, such as graphics
[ Manipulation of Individual Pixels in a Group ]and scientific computing
[ Matrix Multiplication] the amount can be very large. However, workloads such as
cryptography exhibit much less parallelism because of the inherent RAW Data Dependencyamong the constituent Operations.
Micro – Architectural Techniques used to Exploit ILP - 1
Instruction pipelining where the execution of multiple instructions can be partially overlapped.
Superscalar execution in which multiple execution units are used to execute multiple instructions
in parallel. In typical superscalar processors, the instructions executing simultaneously are
Out-of-order execution where instructions execute in any order that does not violate data
dependencies. Note that this technique is independent of both pipelining and superscalar.
Register renaming which refers to a technique used to avoid unnecessary serialization of
program operations imposed by the reuse of registers by those operations, used to enable out-of-
order execution.
Speculative execution which allow the execution of complete instructions or parts of instructionsbefore being certain whether this execution should take place.
A commonly used form of speculative execution is control flow speculation where instructions
past a control flow instruction (e.g., a branch) are executed before the target of the control flow
instruction is determined [ Branch Prediction (used to avoid stalling for control dependencies to
be resolved) ].
Several other forms of speculative execution have been proposed and are in use including
speculative execution driven by value prediction, memory dependence prediction and cache
latency prediction.
Factors Affecting ILP Implementation
Inter Instruction Dependencies:Data Dependency & Control Dependency.Various types of Data Dependencies
A data dependency in computer science is a situation in which a program statement (instruction)
refers to the Data / Operand of a preceding statement / Instruction in some way or the other.
In compiler theory, the technique used to discover data dependencies among statements (or
instructions) is called Dependence analysis.
Data Dependency
Defn:Let‘s consider that in any Computer Program there are two Statements S1 & S2 where the
statement S1 happens to be preceding the statement S2 in the Program.
The Statement S2 is said to be Data dependent on the Statement S1 if any one of the following 3
cases exist.
Data Dependency Conditions
Bernstein Conditions :Assuming statement S1 and S2, S2 depends on S1
where: I (Si) is the set of memory locations read by Si and O (Sj) is the set of memory locations
written by Sj and there is a feasible run-time execution path from S1 to S2.This Condition is
called Bernstein Condition, named by A. J. Bernstein.
Cases of Data Dependency
True (data) Dependence: O(S1) ∩ I (S2)
Statement S1 precedes Statement S2 and S1 writes into some Place (Memory / Register ) thatwill be READ by the Successor Statement S2 . [ Read After Write (RAW) ]
Anti-( Name) Dependence: I(S1) ∩ O(S2) , mirror relationship of true dependence here the
predecessor Instruction S1 Reads from some Memory Location or Register which is later
modified / written onto by the Successor Instruction S2 [ Write After Read (WAR) ].
Output Dependence: O(S1) ∩ O(S2), S1->S2 and both the Instructions S1 & S2 writes to the
same memory location or Register [ Write After Write (WAW) ]
Statement S1 precedes Statement S2 and S1 writes into some Place (Memory / Register ) that
will be READ by the Successor Statement S2 . [ Read After Write (RAW) ]
Example :
A true dependency, also known as a data dependency, occurs when an instruction depends on the
result of a previous instruction:I#1. A = 3
I#2. B = A
I#3. C = B
True Data [RAW] Dependency - 2
Here Instruction I#3 is truly dependent on instruction I#2, as the final value of C depends on the
instruction updating B. Instruction I#2 is truly dependent on instruction I#1, as the final value of
B depends on the instruction updating A.
Since instruction I#3 is truly dependent upon instruction I#2 and instruction I#2 is truly
dependent on instruction I#1, instruction I#3 is also truly dependent on instruction I#1.
Instruction level parallelism is therefore not an option in this example.Anti (Name) [WAR] Dependency
An anti-dependency occurs when an instruction requires a value that is later updated. In the
following example, instruction 3 anti-depends on instruction 2 — the ordering of these
instructions cannot be changed, nor can they be executed in parallel (possibly changing the
instruction ordering), as this would affect the final value of A.
I#1. B = 3
I#2. A = B + 1
I#3. B = 7
An anti-dependency is an example of a name dependency. That is, renaming of variables could
remove the dependency, as depicted in the next Slide:
Removing Anti Dependency through Renaming of Variables
I#1 . B = 3
I#N. B2 = B
I#2. A = B2 + 1
I#3. B = 7
Here a new variable, B2, has been declared as a copy of B in a new instruction, instruction N.
The anti-dependency between the instruction I#2 and the Instruction I#3 has been removed,
meaning that these instructions may now be executed in parallel. However, the modification has
introduced new sets of RAW dependencies like instruction I#2 is now truly dependent oninstruction I#N, which is in turn truly dependent upon instruction I#1.
As true dependencies, these new dependencies are impossible to safely remove.
Output [WAW] Dependency
An output dependency occurs when the ordering of instructions will affect the final output value
of a variable. In the example below, there is an output dependency between instructions I#3 and
The time required to move an instruction one step further in the pipeline.
Not to be confused with clock cycle.Determined by the time required by the slowest stage.
Basic Pipelining Terminologies
Pipeline designers try to balance the length (i.e. the processing time) of each pipeline stage.
For a perfectly balanced N stage pipeline, the execution time per instruction is t/N,where t is the execution time per instruction on non-pipelined machine and N is the number of
pipeline stages.
However, it is very difficult to make the different pipeline stages perfectly balanced. So different
Pipeline stages may possess different Processing time.
Besides, pipelining itself involves some overhead arising due to the Registers/ Latches used
between two successive pipeline stages
Some Important Pipeline Issues
Timing Factors in a Typical Pipeline
Pipeline cycle :
If Inter stage Latch / Register Delay = d
= max {m } + d
Pipeline frequency : f
f = 1 /
Ideal Pipeline Speedup
k-stage pipeline processes n tasks in k + (n-1) clock cycles:
k cycles for the first task and n-1 cycles for the remaining n-1 tasks.
Total time to process n tasks
Tk = [ k + (n-1)]
For the non-pipelined processor
T1 = n k [ n tasks passes through k stages each having delay ]Pipeline Speedup Expression
Speedup(SK )=T1 /TK = n K / [K+(n-1) = n K / [K+(n-1)
Observe that the memory bandwidth must increase by a factor of Sk :Otherwise, the processor
would stall waiting for data to arrive from memory
Pipeline increases instruction throughput:But, does not decrease the execution time of the
individual instructions.In fact, slightly increases execution time of each instruction due to
pipeline overheads since each Instruction passes through Identical Pipeline stages.
Disadvantages of Pipelining:
1.A non-pipelined processor executes only a single instruction at a time. This prevents branchdelays (in effect, every branch is delayed) and problems with serial instructions being executed
concurrently. Consequently the design is simpler and cheaper to manufacture.
2.The instruction latency in a non-pipelined processor is slightly lower than in a pipelined
equivalent. This is because extra flipflops must be added to the data path of a pipelined
processor.
3.A non-pipelined processor will have a stable instruction bandwidth. The performance of a
pipelined processor is much harder to predict and may vary more widely between different
programs.
Pipeline Overheads
Pipeline register delay:Caused due to set up time.Clock skew:the maximum delay between clock arrival at any two registers.
Once clock cycle is as small as the pipeline overhead:No further pipelining would be useful.Very
deep pipelines may not be useful .
EXAMPLES:
Four Stages of an Instruction:
Instruction Fetch(F): Fetch the instruction from the Instruction Memory
Operand Fetch and Instruction Decode(D): Fetch the operand Data from the Memory or Reg
& Decode the inst
Execute(E):Calculate the memory address and/or execute the functionMemory & Write back(M) : Read the data from the Data Memory & Write Back to Register
When a "hiccup" in execution occurs, a "bubble" is created in the pipeline in which nothing
useful happens. In cycle 2, the fetching of the ‗B‘ instruction is delayed and the decoding stage
in cycle 3 now contains a bubble. Everything "behind" the ‗B‘ instruction is delayed as well buteverything "ahead" of the ‗B‘ instruction continues with execution.
Clearly, when compared to the execution above, the bubble yields a total execution time of 8
clock ticks instead of 7.
Bubbles are like stalls, in which nothing useful will happen for the fetch, decode, execute and
writeback. It can be completed with a NOP(no operation) code.
Input Queue. Instruction caches make this phenomenon even worse. This is only relevant
to self-modifying programs.
Mathematical pipelines: Mathematical or arithmetic pipelines are different from
instructional pipelines, in that when mathematically processing large arrays or vectors, a
particular mathematical process, such as a multiply is repeated many thousands of times. Inthis environment, an instruction need only kick off an event whereby the arithmetic logic
unit (which is pipelined) takes over, and begins its series of calculations. Most of these
circuits can be found today in math processors and math processing sections of CPUs like
the Intel Pentium line.
History
Math processing (super-computing) began in earnest in the late 1970s as Vector Processors
and Array Processors. Usually very large bulky super-computing machines that needed
special environments and super-cooling of the cores. One of the early super computers was
the Cyber series built by Control Data Corporation. Its main architect was Seymour Cray,
who later resigned from CDC to head up Cray Research. Cray developed the XMP line of
super computers, using pipelining for both multiply and add/subtract functions. Later, Star
Technologies took pipelining to another level by adding parallelism (several pipelined
functions working in parallel), developed by their engineer, Roger Chen. In 1984, Star
Technologies made another breakthrough with the pipelined divide circuit, developed by
James Bradley. By the mid 1980s, super-computing had taken off with offerings from many
different companies around the world.
Today, most of these circuits can be found embedded inside most micro-processors.
R1=[RSM];DATA.MEMORY ADRS<--[AR];[DATA.MEMORY ADRS]<--[STR]; // FOR STORE INSTRUCTION ONLY[R1]<--[DATA.MEMORY ADRS]; // FOR LOAD INSTRUCTION ONLY[R1]<--[AR]; // FOR ARTHMETIC AND LOGIC INSTRUCTIONS ONLY
1.Pipelined architecture processors from behavioural-level(2001 IEEE) by Robert Heath
and Sreenivas Durbha ,Dept. of Electrical Engineering, 453 Anderson Hall,University of
Kentucky Lexington, KY 40506
2.Low-cost fault tolerance on the ALU in simple pipelined processors (2010 IEEE)Nguyen
Minh Huu∗ , Bruno Robisson and Michel Agoyan† CEA-Leti - Centre Micro´lectronique de
Provence e 880 route de Mimet,France
3. A study of floating-point architectures forpipelined RISC processors byReyes, J.A.P.; Alarcon, L.P.; Alarilla, L.;Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International Symposium
on Publication Year: 2006 , Page(s): 4 pp. – 2716
4.High-level implementation of the 5-stagepipelined ARM9TDM coreArandilla, C.C.; Constantino, J.B.A.; Glova, A.O.M.; Ballesil-Alvarez, A.P.; Reyes, J.A.P.;TENCON 2010 - 2010 IEEE Region 10 ConferencePublication Year: 2010
5. Design of High-Speed-Pipelined Execution Unit of 32-bit RISC ProcessorShofiqul Islam; Debanjan Chattopadhyay; Manoja Kumar Das; V Neelima; Rahul Sarkar;India Conference, 2006 Annual IEEEPublication Year: 2006
6. Design through verilog – T.R. Padmanabhan and B. Bala Tripura Sundari(WSE-2009)
7.Computer system architecture – Morris Mano 3rd Edition-Pearson Education.
8. Advanced Microprocessors And Peripherals by A.k.Ray Tata Mgraw Hill (2006)