Phong Nguyen 1 MIPS CPU: Core Instruction Set Implementation Purpose This machine was built to demonstrate how the core instructions of the MIPS instruction set are implemented in a multi‐cycle CPU. The data path was designed following the “black box” modular design methodology to demonstrate how complex logic can be simplified into individual modules. In addition, the CPU was built and simulated using Verilog in order to familiarize the student with modern synthesis techniques. By implementing the instruction set using a multi‐cycle CPU the designer is able to minimize the average cycle time it takes to execute an instruction. If the designer doesn’t split up the execution of an instruction into different clock cycles, then the designer must ensure that the clock cycle period is long enough for the slowest instruction to execute. This results in large overhead when an instruction is able to execute in much less time than the clock cycle period. A multi‐cycle design allows an instruction that is able to execute in a shorter amount of time to do so, and an instruction that takes longer does not have to worry about a race condition where it must execute before the clock cycle period expires. As a result of letting less complex instructions execute faster, the average time it takes to run a program will be less than the single‐cycle case where every instruction takes as long as the slowest instruction to execute. This multi‐cycle approach demonstrates just one way to improve CPU performance. To simplify synthesis of the machine a hardware description language was used and a modular design methodology was adopted. A modular design methodology allows complex structures and logic, such as the control unit, to be self contained. Organizing the CPU into self contained logic modules enables the CPU to be represented as a data path diagram. This visual representation reduces complexity and enables designers to create more intuitive design decisions. For example, by adopting
44
Embed
MIPS CPU: Core Instruction Set Implementation - ULPGCnunez/clases-micros-para-com/clases-mpc-slide… · Phong Nguyen 1 MIPS CPU: Core Instruction Set Implementation Purpose This
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Phong Nguyen
1
MIPS CPU: Core Instruction Set Implementation
Purpose
This machine was built to demonstrate how the core instructions of the MIPS instruction set are
implemented in a multi‐cycle CPU. The data path was designed following the “black box” modular design
methodology to demonstrate how complex logic can be simplified into individual modules. In addition,
the CPU was built and simulated using Verilog in order to familiarize the student with modern synthesis
techniques.
By implementing the instruction set using a multi‐cycle CPU the designer is able to minimize the
average cycle time it takes to execute an instruction. If the designer doesn’t split up the execution of an
instruction into different clock cycles, then the designer must ensure that the clock cycle period is long
enough for the slowest instruction to execute. This results in large overhead when an instruction is able
to execute in much less time than the clock cycle period. A multi‐cycle design allows an instruction that
is able to execute in a shorter amount of time to do so, and an instruction that takes longer does not
have to worry about a race condition where it must execute before the clock cycle period expires. As a
result of letting less complex instructions execute faster, the average time it takes to run a program will
be less than the single‐cycle case where every instruction takes as long as the slowest instruction to
execute. This multi‐cycle approach demonstrates just one way to improve CPU performance.
To simplify synthesis of the machine a hardware description language was used and a modular
design methodology was adopted. A modular design methodology allows complex structures and logic,
such as the control unit, to be self contained. Organizing the CPU into self contained logic modules
enables the CPU to be represented as a data path diagram. This visual representation reduces
complexity and enables designers to create more intuitive design decisions. For example, by adopting
Phong Nguyen
2
this visual representation the designer is able to follow a signal line as it enters and exits a logic module.
By understanding the different paths that each instruction takes the designer is able to optimize the
paths by replacing a logic module or removing a module. Accessing a module requires clock cycles, so by
replacing a slow module with a fast one or reorganizing the path to negate the need to access a module,
the designer is essentially improving CPU performance. These insights are why the modular design
methodology is so powerful.
Instruction Set Definition
In this implementation only fifteen instructions out of the MIPS core instruction set were
implemented. These instructions were made up of R, I, and J type instructions. Particular attention was
paid to implementing branch and jump instructions so that programming control structures could be
implemented. Memory access instructions, such as load and store word, were also implemented to aid
programming tasks. The remaining instructions were arithmetic or logical in nature and require ALU
access. Refer to the diagram below for details on how the instructions were encoded.
Phong Nguyen
3
Instruction Format
TYPE NAME FORMAT OPCODE10 RS RT RD SHAMT10 FUNCT10R sll R[rd] =R[rt] << SHAMT 0 x x x 0 R srl R[rd] =R[rt] >> SHAMT 0 x x x 2 R add R[rd] = R[rs] + R[rt] 0 x x x 32 R sub R[rd] = R[rs] ‐ R[rt] 0 x x x 34 R and R[rd] = R[rs] & R[rt] 0 x x x 36 R or R[rd] = R[rs] | R[rt] 0 x x x 37 R nor R[rd] = R[rs] | R[rt] 0 x x x 39 R slt R[rd] = (R[rs] < R[rt]) ? 1 : 0 0 x x x 42 OPCODE RS RT IMMEDIATE
I beq if (R[rs] == R[rt])
PC = PC + 1 + SignExtImm 4 x x x
I addi R[rt] = R[rs] + SignExtImm 8 x x x I slti R[rt] = (R[rs] < SignExtImm) ? 1 : 0 10 x x x I lui R[rt] = {SignExtImm, 16’b0} 15 x x I lw R[rt] = M[R[rs] + SignExtImm] 35 x x x I sw M[R[rs] + SignExtImm] = R[rt] 43 x x x OPCODE IMMEDIATE J j PC = Imm 2 X
Architecture
A multi‐cycle MIPS data path was chosen as the target architecture for its uniformity and simplicity. The
three different instruction types outline a template that all instructions must follow. This allows the
designer to take advantage of shared control signals and greatly reduce the complexity of the control
logic. The implementation of the immediate arithmetic instructions is a case in point.
On the surface immediate arithmetic instructions operate very much like an R type instruction,
but it also contains features of I type instructions. This similarity enables us to reduce the control logic
by copying the memory computation state, but with one minor change. Instead of an ALUOp of 0, which
instructs the ALU to perform addition, an ALUOp of 3 is used. Much like how an ALUOp of 0 represents
an R type arithmetic instruction, an ALUOp of 3 represents an immediate arithmetic instruction. By
Phong Nguyen
4
reusing the ALUOp control signal we are able to reuse the ALU controller module, but extend it with
minor modifications to support this new category of instructions. Since the immediate arithmetic
instructions do not contain a FUNCT field an alternative tag must be used to indicate what operation the
ALU should perform. To support this function the OpCode field of the instruction is extended to the ALU
controller module. This works because each immediate arithmetic instruction contains a unique
OpCode, unlike R type instructions that all have an OpCode of 0. R type instructions also exhibit control
signal sharing. Instead of using the OpCode as the tag, all R type instructions have a FUNCT field that is
used to control the ALU. The uniformity of the MIPS instruction set allows such optimizations to take
place, which greatly reduces logic complexity. This optimization is used throughout the design of the
CPU.
Modules
The following figure is the data path implemented by this machine. It takes advantage of
modular design and module reuse optimizations to greatly reduce the complexity of the machine. Once
the organization of the data path is outlined the individual modules and their functionalities can be
described and synthesized using Verilog.
Phong Nguyen
5
Memory
In the multi‐cycle implementation the memory module is used for both data and instruction.
This reduces one memory module from the single cycle implementation, but requires more complex
control logic. This is a reasonable trade off because the combinational control logic is usually cheaper to
implement than a memory hierarchy.
To be able to use one memory module for both data and instruction a multiplexor is needed to
determine whether to read the next instruction or to read a piece of data from memory. In addition,
two control lines are needed to enable reading and writing. Closer inspection will show that this can be
further simplified by using the negation of either the read signal or the write signal, thus reducing the
control lines from two to one. This is not done to maintain easy comprehension.
Phong Nguyen
6
Another simplification is made to this particular memory module. Since instructions that address
a byte are not implemented the memory structure is not byte addressable. This negates the need to
shift left by two when calculating branch addresses and jump addresses.
Register File
The register file does not change very much from the single cycle implementation. What do
change are its inputs and outputs. The input for write data is hooked up to a multiplexor with a register
called Memory Data as one of its inputs. This is needed to retain the value read from memory for use in
a later cycle. The two read data ports need to be saved as well so that it can be supplied later to the ALU
if needed. There must also be a write control signal to prevent invalid writes.
Instruction Register
The instruction register was created as its own module to enhance comprehension. The
behavior of this module could be defined in Verilog using simple reg constructs, but encapsulating the
behavior of the instruction register into its own module hides the complexity of deriving the various
instruction fields. The behavior is pretty simple: if IRWrite is asserted the instruction register is written
with the instruction from memory and then it derives the various instruction fields and provides them as
outputs for the rest of the data path to use.
ALU
The operation of the ALU is totally dependent upon the ALU control signal that comes from the
ALU controller module. The ALU control signal is 4 bits wide and it tells the ALU what operation to
perform. In addition, this implementation of the ALU also accepts the SHAMT field as inputs. This field
contains the amount to shift. Depending upon the value of the FUNCT field this could either be a left
logical shift or a right logical shift. What makes this ALU unique, though, is its ability to perform a variety
of operations. This reduces the number of ALUs from the single cycle data path down to just one; greatly
Phong Nguyen
7
reducing the cost. To perform both arithmetic and address computations, the inputs of the ALU are
chosen using a number of multiplexors, whose select signals are derived from the control module. This
enables the inputs to the ALU to come from different sources depending on the instruction.
ALU Control
This ALU control module is slightly different, as mentioned earlier. It accepts the OpCode field of
the instruction as an additional input. This allows it to decode immediate arithmetic instructions and
produce the appropriate ALU control output signals. How the control module interprets its inputs
depends upon the ALUOp code. There are essentially three codes that are implemented in this machine.
An ALUOp of 2 tells the control module to use the FUNCT field to generate the control outputs. An
ALUOp of 0 automatically tells the ALU control to tell the ALU to perform addition. An ALUOp of 1 tells
the ALU control to tell the ALU to perform subtraction. Finally, an ALUOp of 3 tells the ALU control to
use the OpCode to derive the appropriate ALU control signals.
Control
The control is the most complex module to be synthesized and it is also the most crucial to the
proper functioning of the CPU. The module is essentially a state machine that first fetches and decodes
the OpCode, and then depending upon what the OpCode is it proceeds through the proper states.
During each state the appropriate control signals are set for the current instruction. The control also has
the job of deciding which module to access during the current clock cycle. As mentioned earlier, the
complexity of the control logic is already greatly reduced because the MIPS architecture was adopted.
Implementing new instructions is also greatly simplified by adopting the MIPS architecture. By
identifying what type of format the new instruction will be many of the states in the control can be
reused with only slight modification. An example is the ComputeImm state, a variation of the
ComputeAddr state, designed specifically for the immediate arithmetic instructions by setting the
Phong Nguyen
8
ALUOp to 3. Another example is the ImmCompletion state; a variation of the RTYPECompletion state,
except the RegDst is set to 0 to select RT as the write address rather than RD.
Testing
A Fibonacci term calculator was written using the MIPS instructions that were implemented by
the machine. This program was not able to test all instructions, but provided a good start. Given N the
program would compute the (N‐1)th term of the Fibonacci sequence. First the initial variables are
loaded into registers from memory. There is a space in memory reserved for program instructions
starting at memory address 0 and with anything after memory address 50 reserved for program data.
Once calculated the data is stored into the appropriate space in memory. For the Fibonacci calculator,
the result is stored into memory address 53. To be able to accomplish the behavior described above the
program must thoroughly test the memory access instructions and branch and jump instructions. It
must also test a few of the arithmetic instructions. To test the arithmetic instructions, a mix of
immediate arithmetic and R type arithmetic instructions were used to optimize the test.
The next program tests the rest of the instructions, the majority of which are R type arithmetic
and immediate arithmetic instructions. These are pretty simple tests as they just pose initial signed
decimal values and then perform the arithmetic operation on them. The result is stored then stored
back into the data memory space.
Conclusion
The MIPS instruction set is designed to be very easy to implement in hardware. Everything from
the encoding of the instructions to the design of the data path took advantage of spatial and temporal
locality. The uniformity allowed signals and modules to be shared, which reduced the number of cycles
needed to execute an instruction. The splitting of instruction execution into multiple clock cycles takes
advantage of temporal locality by allowing faster instructions to execute faster. The MIPS instruction set
Phong Nguyen
9
is simple to understand, yet it is able to demonstrate the basic concepts of CPU design well. Presented
here was just a subset of the MIPS instruction set. Further exploration will definitely yield further design
techniques to further optimize the CPU. Concepts that would be worth exploring would be pipelining
and floating point instructions, but that is beyond the scope of this implementation.
Lessons Learned
Tools for debugging Verilog are very archaic. Compared to modern IDE such as Eclipse,
ModelSim’s Verilog debugger leaves much to be wanted. As a result it took hours to debug a simple
mistake that is buried in layers of code. The design concepts were not hard to understand, but the
frustration of synthesizing it using Verilog became very frustrating. This was the most time consuming
part of the project.
There were design lessons to be learned as well. Through in depth examination of the different
aspects of the MIPS instruction set it is obvious that spatial and temporal locality is taken full advantage
of. Also, the concept of making the common fast is applied frequently. It would not be surprising if
further inspection of modern processors yields similar optimizations based upon those fundamental
reg [11:0] State; // sequential logic always @ (posedge Clock) begin case (State) InstrFetch: State <= InstrDecode; InstrDecode: begin case (OpCode) RTYPE: State <= Execution; LW: State <= ComputeAddr; LUI: State <= ComputeImm; SW: State <= ComputeAddr; BEQ: State <= BranchCompletion; ADDI: State <= ComputeAddr; SLTI: State <= ComputeImm; J: State <= JumpCompletion; endcase end ComputeAddr: begin case (OpCode) LW: State <= MemReadAccess; SW: State <= MemWriteAccess; ADDI: State <= ImmCompletion; endcase end ComputeImm: State <= ImmCompletion; MemReadAccess: State <= WriteBack; MemWriteAccess: State <= InstrFetch; WriteBack: State <= InstrFetch; Execution: State <= RTYPECompletion; ImmCompletion: State <= InstrFetch; RTYPECompletion: State <= InstrFetch; BranchCompletion: State <= InstrFetch; JumpCompletion: State <= InstrFetch; default: State <= InstrFetch; endcase end // combinational logic always @ (State) begin // we want everything to be zero if it is not explicitly set in each statePage 2 User Bao Nguyen December 04, 2007
ComputeImm: begin // muxes ALUSrcA = 1; ALUSrcB = 2'b10; ALUOp = 2'b11; end MemReadAccess: begin // control signals MemRead = 1; IorD = 1; end MemWriteAccess: begin // control signals MemWrite = 1; IorD = 1; end WriteBack: begin // control signals RegWrite = 1; // muxes RegDst = 0; MemToReg = 1; end ImmCompletion: begin // control signals RegWrite = 1; // muxes RegDst = 0; MemToReg = 0; end Execution: begin // muxes ALUSrcA = 1; ALUSrcB = 2'b00; ALUOp = 2'b10; end RTYPECompletion: begin // control signals RegWrite = 1; // muxesPage 4 User Bao Nguyen December 04, 2007
D:/Classes/CPR E 305/FinalProject/Control.v
RegDst = 1; MemToReg = 0; end BranchCompletion: begin // control signals PCWriteCond = 1; // muxes ALUSrcA = 1; ALUSrcB = 2'b00; ALUOp = 2'b01; PCSource = 2'b01; end JumpCompletion: begin // control signals PCWrite = 1; // muxes PCSource = 2'b10; end default: begin // control signals MemRead = 1; IRWrite = 1; PCWrite = 1; // muxes ALUSrcA = 0; IorD = 0; ALUSrcB = 2'b01; ALUOp = 2'b00; PCSource = 2'b00; end endcase endendmodule
Memory[19] = {6'd43, 5'd0, 5'd3, 16'd56}; Memory[20] = {6'd0, 5'd1, 5'd2, 5'd3, 5'd0, 6'd42}; Memory[21] = {6'd43, 5'd0, 5'd3, 16'd57}; Memory[22] = {6'd0, 5'd2, 5'd1, 5'd3, 5'd0, 6'd42}; Memory[23] = {6'd43, 5'd0, 5'd3, 16'd58}; Memory[24] = {6'd10, 5'd1, 5'd3, 16'd37}; Memory[25] = {6'd43, 5'd0, 5'd3, 16'd59}; Memory[26] = {6'd10, 5'd1, 5'd3, 16'd64}; Memory[27] = {6'd43, 5'd0, 5'd3, 16'd60}; Memory[50] = 32'd1; // the constant 1 to load into $1 (f1) Memory[51] = -32'd1; // the constant -1 to load into $2 (f2) Memory[52] = 32'd8; // the (n-1)th Fibonacci number to calculate goes into $3 end // read from memory assign MemData = MemRead ? Memory[Address] : 0; // the pc is incremented by 4 bytes, which is the nextword. // this is due to the fact that the system is byte addressable. // we will ignore this, which will simplify the designby not // requiring the immediate value to be shifted left by2 on a branch. // write to memory always @ (posedge Clock) begin if (MemWrite) begin Memory[Address] <= WriteData; end endendmodule
Page 3 User Bao Nguyen December 04, 2007
D:/Classes/CPR E 305/FinalProject/Memory.v
Memory Output.txt 12/4/2007
1 // memory data file (do not edit the following line - required for mem load use)