Top Banner
Microelectronic System Design Final Project Report Designing a custom DLX processor with Very Long Instruction Word support Vittorio Giovara 149374 04/08/2008 http://gle-mips.googlecode.com
17

Designing a custom DLX processor with Very Long Instruction Word support

Apr 11, 2015

Download

Documents

Documentation report of a completely new processor, implementing a non pipelined DLX architecture; moreover it supports the Very Long Instruction Word structure, meaning that two parallel instructions can be executed. A full physical design has been carried out.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Designing a custom DLX processor with Very Long Instruction Word support

Microelectronic System Design

Final Project Report

Designing a custom DLX processorwith Very Long Instruction Word support

Vittorio Giovara149374

04/08/2008

http://gle-mips.googlecode.com

Page 2: Designing a custom DLX processor with Very Long Instruction Word support

Contents

1 Overview 4

2 Processor Desing 52.1 General Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Arithmetic Logic Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Module Interconnection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 PC / NPC conflict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 VLIW Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Physical Design 83.1 Systesys Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Results Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.3 On Silicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Optimizations 114.1 Going back to RTL level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Syntesys Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Final Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3.1 Final Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Bibliography 14

A Simulation Waves 15

1

Page 3: Designing a custom DLX processor with Very Long Instruction Word support

List of Tables

3.1 Non optimized synthesys constraints . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Non optimized synthesys timing results . . . . . . . . . . . . . . . . . . . . . . 93.3 Non optimized synthesys power results . . . . . . . . . . . . . . . . . . . . . . . 93.4 Non optimized synthesys area results . . . . . . . . . . . . . . . . . . . . . . . . 93.5 Non optimized synthesys silicon results . . . . . . . . . . . . . . . . . . . . . . 10

4.1 Optimized synthesys constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Optimized synthesys timing results . . . . . . . . . . . . . . . . . . . . . . . . . 124.3 Optimized synthesys power results . . . . . . . . . . . . . . . . . . . . . . . . . 134.4 Optimized synthesys area results . . . . . . . . . . . . . . . . . . . . . . . . . . 134.5 Optimized synthesys silicon results . . . . . . . . . . . . . . . . . . . . . . . . . 13

2

Page 4: Designing a custom DLX processor with Very Long Instruction Word support

List of Figures

1.1 Optimized DLX processor on silicon, simulating IRdrop measurements . . . . . 4

2.1 DLX processor structural view . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 DLX processor on silicon after place and route . . . . . . . . . . . . . . . . . . 93.2 DLX processor IR drop rail analysis . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1 Optimized DLX processor silicon after place and route . . . . . . . . . . . . . . 12

A.1 DLX simulation waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3

Page 5: Designing a custom DLX processor with Very Long Instruction Word support

Chapter 1

Overview

THIS DOCUMENT WILL focus on the implementation details about designing a custom DLXprocessor with Very Long Instruction Word and a non pipelined data path. An architectural

description of the processor will be provided, outlining main modules functionalities and generalstructure desing. Afterwards the design will carry on to the physical design level, providing de-tailed information about system on chip simulation, power consumption and gate characterization.

The industrial standard tools such as Design Vision from Synopsys and First Encounter fromCadence are being used in the process.

Figure 1.1: Optimized DLX processor on silicon, simulating IRdrop measurements

4

Page 6: Designing a custom DLX processor with Very Long Instruction Word support

Chapter 2

Processor Desing

IN THIS CHAPTER the detailed explanation of the processor design will be provided; the twomost important modules, the Control Unit and the Arithmetic Logic Unit, will be carefully

analyzed, taking in consideration the distributed decoding approach. Furthermore an overviewon the Very Long Instruction Word support will be available, carefully analyzing implementationdetails.

2.1 General Structure

The processor is built using the classical DLX implementation, with a simple non pipelined Con-trol Unit and distributed decoding of the instruction. Every instruction requires exactly five clockcycles for complete execution.

At any time processor can activate a second data path, exploiting the Very Long InstructionWord architecture, that is two instructions can be executed in parallel at the same time, with noproblems of data dependecies, as they have already been resolved by the compiler.

In figure 2.1 is reported the schematic view of the processor describing the double data pathand the 4R2W1 register file and memory.

2.1.1 Control Unit

The control unit inserted in the processor uses a microcode control memory with instruction re-location. Every I-type and J-type instructions activates different control words according to theopcode. R-type2 instructions that have a single opcode (all zeroes) are relocated in a similar way,but the func field (11 bits) shifted is used; this field is shifted left by two in order not cause conflictswith the opcode of I and J type instructions.

The microcode is composed of four line per instruction plus the two lines for resetting (all ze-roes) and instruction fetch that are the same for every instruction. In order to acces the microcode,a relocation vector is used, mapping every opcode or func to the corresponding control word.

Finally the control unit provides with the correct operation selection for the Arithemetic LogicUnit, again using the opcode or the func of the instruction.

2.1.2 Arithmetic Logic Unit

The Arithmetic Logic Unit is capable of

• Signed/Unsigned Addition1this acronym means, 4 read ports 2 write ports2respectively register-to-integer and jump and branch and register-to-register instructions

5

Page 7: Designing a custom DLX processor with Very Long Instruction Word support

2.2 Module Interconnection 6

Figure 2.1: DLX processor structural view

• Signed/Unsigned Subtraction

• Logical Operation (and or xor)

• Comparisons

The addition/subtraction module uses a sparse tree implementation, allowing very fast carryspreading and result computation; the logical module use a similar module present in SPARC T2processor, with four selection signals; the comparison module is implemented in a behavioral way,using the subtraction result of the two inputs.

The input operands from the two multiplexers are connected to all the modules above andeach one of them provides a result; only one result is selected with the alu opcode controlsignal provided directly by the Control Unit.

2.2 Module Interconnection

In order to reflect general DLX implementations, the Control Unit doesn’t provide all the controlsignals needed during the various stages, but leaves most instruction decoding during the processof execution.

Page 8: Designing a custom DLX processor with Very Long Instruction Word support

2.3 VLIW Support 7

As a matter of fact, the registers (Rs1, Rs2 and Rd) are activated in five different categories bythe ARG PROC process:

1. For R-type instructions, bits from 25 to 12 are for the three registers respectively;

2. For unconditional jumps, the first register is always set to 0, in order to force jump exe-cution, while the second register is used for register operations (like jalr or jr) and thedestination register is forced to 31 for jal and jalr operations (this doesn’t cause prob-lems with the other jump instructions because the register file write enable is activated onlyfor these two instructions);

3. For conditional jumps, the first register is needed for comparison and branch evaluation;

4. For store intstructions, a particular configuration is needed since the second register is usedas content to be saved in memory;

5. For all the other instructions, the first register is the first operand while the second registercorresponds to the destination.

While providing the registers, this process also performs sign extension of the immediatenumber and save the correct value in the related immediate register; integers are usually on 16 bitsbut they are extended to 26 for unconditiona jumps.

Another distributed control is the size of the word to be saved in memory or loaded from it: asimple behavioral multiplexer selects the control signals according to the opcode of the instruction.

Also the saving of results in the register file has distributed control with a similar approach:there is multiplexer that by default selects data arriving from the Write Back stage, but for jal andjalr the correct value of the Program Counter is chosen (the destination register has already beenselected by the ARG PROC process) and sent to the register file.

2.2.1 PC / NPC conflict

The possible clock strobe loss due to the presence of two sequential modules (the Program Counterand the New Program Counter) in the same stage has been initially resolved using a register withsensitivity on the falling edge of the clock.

However this element caused a frequency reduction, as data must be ready in a shorter periodof time, even if the critical path was not involved. As explained later in section 4.1 this modulehas been substituted with a simple latch

2.3 VLIW Support

The Very Long Instruction Word support can be enabled or disable through the vliw en controlsignal: the instruction memory is designed to provide a 64 bits instructions, split and sent to twodifferent data path.

The LSB part of the long instruction is always executed and any instruction can be inserted; theMSB part of it can contain a reduced set of instructions only or no instruction at all. No instructionis provided when the vliw en is set to zero and no data is written in the register file or memorymodule, so there are no problems of data corruption; however in order to build a simpler structure,the second data path cannot execute jump or branches.

The register file and the memory module have been carefully redesigned with double writeports and quadruple read ports with an additional input for the vliw en signal that disconnectsthe additional write port when VLIW support is turned off.

Page 9: Designing a custom DLX processor with Very Long Instruction Word support

Chapter 3

Physical Design

AFTER DETAILED EXPLANATION of the processor structure, this chapter will proceed to thephysical design level, reporting syntesys configuration and relative results of the non opti-

mized design.

3.1 Systesys Configuration

The first pass of non optimized synthesys was given with normal quite dull constraints, and withnormal compilation efforts; the table of the values used follows

Type ValueClock Period 40 nsPower Consumption 550 µWArea Size unconstrained

Table 3.1: Non optimized synthesys constraints

Every module of the processor has been analyzed, elaborated and compiled by Synopsys withno errors; also a minimum amount of RAM, only two lines, has been synthetysed because other-wise a missing module generates problems in the silicon placement in Encounter.

The command for applying the above constraints are:

create_clock -name "CLK" -period 40 clockset_max_delay 40 -from [all_inputs] -to [all_outputs]set_max_dynamic_power 550 uW

3.2 Results Report

Synopsys did manage to respect the given constraints both for the timing and for the power; intable 3.2 it is possible to see that the slack time is positive and thus there are no potential timingconflicts.

Also the power constraint has been respected, as the obtained Dynamic Power in table 3.3 islower than the value inserted.

8

Page 10: Designing a custom DLX processor with Very Long Instruction Word support

3.3 On Silicon 9

Type ValueData Required Time 39,90 nsData arrival Time -24,46 nsSlack 15,44 ns

Table 3.2: Non optimized synthesys timing results

Type ValueCell Internal Power 496,97 µWNet Switching Power 34,28 µWTotal Dynamic Power 531,25 µW

Table 3.3: Non optimized synthesys power results

Type ValueCombinational area 57983,38 mm2

Noncombinational area 25153,22 mm2

Total area 83136,61 mm2

Table 3.4: Non optimized synthesys area results

3.3 On Silicon

Now the design process has been carried on to Silicon route and placement for exacting moredetailed information about IRdrop, electromicrogration and delay.

Figure 3.1: DLX processor on silicon after place and route

Page 11: Designing a custom DLX processor with Very Long Instruction Word support

3.3 On Silicon 10

As it’s possible to see from picture 3.1 there are four route violations, marked by white crosses.However it didn’t influence the delay computation that showed no maximum delay violation; themaximum delay correspond to the previously selected clock period (40 ns).

As for the power results, Encounter reported the following data1:

Type ValueAverage Power 1,305 mWAverage Leakage Power 0,083 mWWrost IR drop 12,1 mVWrost Electromigration (M1) 29,539 µA

Table 3.5: Non optimized synthesys silicon results

Figure 3.2: DLX processor IR drop rail analysis

1more exahustive data is reported in the project data files

Page 12: Designing a custom DLX processor with Very Long Instruction Word support

Chapter 4

Optimizations

IN THIS FINAL CHAPTER a complete optimization analysis will be performed, restarting fromthe RTL level and coming back to silicon representation.

4.1 Going back to RTL level

In order to achieve higher frequencies, the design has restarted from the RTL level, in which singlemodules are implemented.

There is a particular module that previously cause a perfomance loss: the New ProgramCounter falling edge register. Even if the critical path is not hit by this factor very much, datafrom the PC was to be delivered in half clock period, causing a general slow down.

This module has been substituted with a simple latch, with rising edge sensitivity, so the PC /NPC conflict due to the presence of two sequential modules in the same stage is still not violated.

4.2 Syntesys Reconfiguration

The target optimization is to increase timing frequency and reduce power consumption, neglectingarea constraints; this time it has been put a higher computation load in both the synthesys processand silicon placement and route. So the constraints used are

Type ValueClock Period 30 nsPower Consumption 250 µWArea Size unconstrained

Table 4.1: Optimized synthesys constraints

As for the power reduction, since this processor has a pecular structure, a Very Long Instruc-tion Word path that can be unconnected at any time, a power reduction model that is very effectiveis Clock Gating. Due to the fact that the second data path can be unused, clock gating deactivatesthe clock strobes for unused modules, until actual input data arrives.

So the configuration script used in Synopsys is

create_clock -name "CLK" -period 30 clockset_max_delay 30 -from [all_inputs] -to [all_outputs]

11

Page 13: Designing a custom DLX processor with Very Long Instruction Word support

4.3 Final Results 12

set_clock_gating_style -sequential_cell latch-positive_edge_logic {and} -negative_edge_logic {or}

set_max_dynamic_power 250 uWpropagate_constraints -gate_clock

compile -exact_map -gate_clock -map_effort high -power_effort high

4.3 Final Results

After synthesys the processor has been placed on silicon with Ecounter; no violations were de-tected.

Figure 4.1: Optimized DLX processor silicon after place and route

All the synthesys constraints have been respected, in fact the new configuration reports

Type ValueData Required Time 30 nsData arrival Time -3,78 nsSlack 26,22 ns

Table 4.2: Optimized synthesys timing results

Page 14: Designing a custom DLX processor with Very Long Instruction Word support

4.3 Final Results 13

Type ValueCell Internal Power 196,54 µWNet Switching Power 46,73 µWTotal Dynamic Power 243,27 µW

Table 4.3: Optimized synthesys power results

Type ValueCombinational area 63230,93 mm2

Noncombinational area 26457,17 mm2

Total area 89688,1 mm2

Table 4.4: Optimized synthesys area results

It’s possible to see in table 4.2 that the timing for the data path has been reduced to the valueselected, and that the slack is positive so no timing violation are present. The power consumptionin table 4.3 is greatly reduced thanks to clock gating technology; as a matter of fact this technl-ogy correctly increased the Combinational Area size and Net Switching power, reducing the CellInternal Power. The area size has suffered only a small area increase (table 4.4).

After the silicon placement and route, the extracted values from Encounter are1:

Type ValueAverage Power 1,911 mWAverage Leakage Power 0,101 mWWrost IR drop 15,09 mVWrost Electromigration (M1) 0,158 mA

Table 4.5: Optimized synthesys silicon results

4.3.1 Final Conclusions

From the synthesys point of view the optimization was quite successfull, as with only 7% areaexpansion, the frequency has been increased by 25% and the power consumption has been reducedby 54%. However from the silicon placement point of view it’s interesting to notice that theextracted values slightly enlarged: this is normal because, most likely, in order to sastify theoptimized time and power constraints, Encounter must have used very high speed ports whichrequire more current and dissipate more leakage power.

1more details in the project data files is available

Page 15: Designing a custom DLX processor with Very Long Instruction Word support

Bibliography

[1] John L. Hennessy, David A. Patterson, Computer Architecture: A Quantitative Approach

[2] Frank Emnett, Mark Beigel, Power Reduction through RTL clock gating

[3] Wikipedia, the free encyclopedia, DLX,http://en.wikipedia.org/wiki/DLX

[4] Wikipedia, the free encyclopedia, Very long instruction word,http://en.wikipedia.org/wiki/Very_long_instruction_word

14

Page 16: Designing a custom DLX processor with Very Long Instruction Word support

Appendix A

Simulation Waves

Here is provided the assembly program and the resulting waves, used to test the processor func-tionalities; the testbench deactivates the VLIW support after ten clock cycles, and in figure A.1it’s possible to see that execution continues with single instructions.

addi r5,r2,#13addui r7,r2,#15sub r11,r7,r5sw 10(r0),r11addu r14,r5,r7slli r1,r11,#1jal #32nopnoplw r2,10(r0)addu r14,r11,r7and r22,r2,r2sge r20,r14,r1bnez r31, #-4nopnopnopnop

15

Page 17: Designing a custom DLX processor with Very Long Instruction Word support

16

0 13 0 2 0 28 0 4 0

0 15 0

0 13 2 28 4

0 15 10 0

0 8 16 20 24 28

adds subs addu llsh adds

addu adds nop

0 ps 500000 ps 1000000 ps 1500000 ps 2000000 ps

/tb_glx/cpu/clock

/tb_glx/cpu/reset

/tb_glx/cpu/vliw_en

/tb_glx/cpu/result 0 13 0 2 0 28 0 4 0

/tb_glx/cpu/result_vliw 0 15 0

/tb_glx/cpu/aluout 0 13 2 28 4

/tb_glx/cpu/aluout_vliw 0 15 10 0

/tb_glx/cpu/newpc 0 8 16 20 24 28

/tb_glx/cpu/i_ir_latch_enable

/tb_glx/cpu/i_npc_latch_enable

/tb_glx/cpu/i_ir_latch_enable_vliw

/tb_glx/cpu/i_rega_latch_enable

/tb_glx/cpu/i_regb_latch_enable

/tb_glx/cpu/i_regimm_latch_enable

/tb_glx/cpu/i_rega_latch_enable_vliw

/tb_glx/cpu/i_regb_latch_enable_vliw

/tb_glx/cpu/i_regimm_latch_enable_vliw

/tb_glx/cpu/i_eq_condition

/tb_glx/cpu/i_jump_enable

/tb_glx/cpu/i_alu_opcode adds subs addu llsh adds

/tb_glx/cpu/i_muxa_selection

/tb_glx/cpu/i_muxb_selection

/tb_glx/cpu/i_alu_outreg_enable

/tb_glx/cpu/i_alu_opcode_vliw addu adds nop

/tb_glx/cpu/i_muxb_selection_vliw

/tb_glx/cpu/i_alu_outreg_enable_vliw

/tb_glx/cpu/i_dram_wenable

/tb_glx/cpu/i_lmd_latch_enable

/tb_glx/cpu/i_pc_latch_enable

Entity:tb_glx Architecture:test Date: Thu Jul 31 04:08:40 PM CEST 2008 Row: 1 Page: 1

Figure A.1: DLX simulation waves