nios 2

This thesis proposes a design of Fast Fourier Transform and applies it into Nios II

Embedded Processor. This chapter covers the motivation, problem statement, project objectives,

and scope of work, project contributions and finally thesis organization. In last few decades,

embedded systems have experienced an accelerating growth both in computing power and scope

of their possible applications. Moreover the designing procedure for embedded system also

changed immensely. As the application demands goes on increasing with the time the complexity

of the embedded system is waxing. Combination of software and hardware in design leads to

improve the system performance such approach is known as Co-Design.

1.1 CO-DESIGN

Hardware/software co-design is the main technique used in the thesis. It can be defined as the

cooperative design of hardware and software. Co-design methodology deals with the problem of

designing heterogeneous systems. One of the goals of co-design is to shorten the time-to-market

while reducing the design effort and costs of the designed products. Co-design can be

implemented on embedded systems and processor is the main part of any embedded system. The

advantages of using processors are manifold, because software is more flexible and cheaper than

hardware. This flexibility of software allows late design changes and simplified debugging

opportunities. Furthermore, the possibility of reusing software by porting it to other processors

reduces the time-to market and the design effort. Finally, in most cases the use of processors is

very cheap compared to the development costs of ASICs, because processors are often produced

in high-volume, leading to a significant price reduction. However, hardware is always used by

the designer, when processors are not able to meet the required performance. This trade-off

between hardware and software illustrates the optimization aspect of the co-design problem. Co-

design is an interdisciplinary activity, bringing concepts and ideas from different disciplines

together, e.g. system-level modeling, hardware design and software design.

The design flow of the general co-design approach is depicted in figure 1

Step 1: The co-design process starts with specifying the system behavior at the system level.

Step 2: After this, a pure software system will be developed to verify all algorithms.

Step 3: Performance analysis will be performed to find out the system bottlenecks.

1

Step 4: The hardware/software partitioning phase a plan will be made to determine which parts

will realized by hardware and which parts will be realized by software. Obviously, some

system bottlenecks will be replaced by hardware to improve the performance.

Step 5: based on the results of step 4, hardware and software parts will be designed respectively.

Step 6: co-simulation. At this step, the completed hardware and software parts will be integrated

together and performance analysis will be performed.

Step 7: if the performance meets the requirements, the design can stop and if the Performance

can’t meet the requirements, new HW/SW partitioning and a new design.

Figure 1- the design flow of the general co-design approach

10

Hardware and softwarepartitioning & codesign

Integrationand test

Hardwarefab.

Softwarecode

Hardwaredesign

Softwaredesign

Hw and

Swparts

Systemdef.

Functiondesign

Primarily Virtual Prototype Primarily software hardware

1.2 FFT Algorithm

Fast Fourier Transform (FFT) algorithms are widely used in many areas of science and

engineering. Some of the most widely known FFT algorithms are Radix-2 algorithm, Radix-4

algorithm, Split Radix algorithm, Fast Hartley transformation based algorithm and Quick Fourier

transform. The Discrete Fourier Transform (DFT) is used to produce frequency analysis of

discrete non-periodic signals. The FFT is another method of achieving the same result, but with

less overhead involved in the calculations.

One of the most widely used technique in science and engineering is the concept of Fourier

Transform and other algorithms based on it. In signal processing, it is primarily used to convert

an input signal in time domain into frequency domain and vice-versa. In the world of digital,

signals are sampled in time domain. So, we have Discrete Fourier Transform (DFT) in the digital

world. DFT is applied on a discrete input signal and we get the frequency characteristics of the

signal as the output. Performing inverse DFT, which has a mathematical form very similar to the

DFT, on the frequency domain result gives back the signal in the time domain. This means that

the signal when converted into frequency domain will give us the various frequency components

of the input signal and then can be used to remove certain unwanted frequency components. This

concept can be used in image or audio compression and filters on communication signals to

name a few. Discrete Fourier Transform is a very computationally intensive process that is based

on summation a finite series of products of input signal values and trigonometric functions. Its

time complexity of the algorithm in O(n2).To increase the performance, several algorithms were

proposed which can be implemented in hardware or software. These set of algorithms are known

as Fast Fourier Transforms (FFT). The first major FFT algorithm was proposed by Cooley and

Tukey. Many FFT algorithms were proposed with a time complexity of O(nlogn). Some of them

are Radix-2 algorithm, Radix-4 algorithm and Split Radix algorithm. In this paper, we discuss

ways of parallelizing these algorithms to reduce the communication overhead.

1.3 Motivation

The Fast Fourier transform is a critical tool in digital signal processing where its value in

analyzing the spectral content of signals has found application in a wide variety of applications.

10

The most prevalent of these applications is being in the field of communications where the ever

increasing demand on signal processing capabilities have given rise to the importance of the

Fourier transform to the field.

However, the Fourier transform is a part of many systems in a wide variety of industrial and

research fields. Its uses range from signal processing for the analysis of physical phenomena to

analysis of data in mathematical and financial systems. The majority of systems requiring

Fourier transforms are real time systems which necessitate high speed processing of data. Given

the complexity in performing The Discrete Fourier, the implementation of high speed Fast

Fourier transform has required the use of dedicated hardware processors. The majority of high

performance Fourier transforms has required the use of full custom integrated circuits and has

typically been in the form of an application, specifically integrated circuit. Although much work

has been put into raising performance while reducing hardware requirements, and also cost, the

cost of full custom hardware still limits the availability of Fourier transform hardware to low

volume production. Nevertheless the development of programmable logic hardware has

produced devices that are increasingly capable of handling large scale hardware. High density

field programmable gate arrays (FPGA) that are already available in the market can boast

upwards of 180,000 logic elements, nine megabits of memory, and on board processors. The use

of FPGA in implementing hardware eliminates the need for the long and costly process of

creating a full custom integrated circuit and the time and cost of testing and verification. Saving

cost in designing, testing, and time from design to a functional device. These features of the

FPGA make it especially attractive for the purpose of creating embedded processors for research

and development purposes. However the design of any of embedded processors must consider

two important factors efficiency and flexibility for reaching an ideal design.

1.4 Problem Statement

Efficiency and flexibility are two of the most important driving factors in embedded system

design. Efficient implementations are required to meet the tight cost, timing, and power

constraints present in embedded systems. Flexibility, albeit tough to quantify, is equally

important; it allows system designs to be easily modified or enhanced in response to bugs,

evolution of standards, market shifts, or user requirements, during the design cycle and even

10

after production. Various implementation alternatives for a given function, ranging from custom-

designed hardware to software running on embedded processors, provide a system designer with

differing degrees of efficiency and flexibility. Unfortunately, it is often the case that these are

conflicting design goals. While efficiency is obtained through custom hardwired

implementations, flexibility is best provided through programmable implementations.

Hardware/software partitioning separating a system’s functionality into embedded software

(running on programmable processors) and custom hardware (implemented as coprocessors or

peripheral units) is one approach to achieve a good balance between flexibility and efficiency.

1.5 Project Objectives

The aims of this project are as follow:

Design and implementation of Fast Fourier Transform (FFT) algorithm into embedded system

a) Utilizing Nios II embedded processor.

b) Implementation of FFT algorithm using NIOS II Processor without Custom Instruction.

c) Implementation of FFT algorithm using NIOS II Processor with Custom Instruction.

d) Comparison of algorithm in terms of speed and area for both the designs of FFT

algorithm with and without custom instruction

1.6 Scope of Work

Taking into account the resources and time available, this project is narrowed down to the

following scope of work.

1. This project only considers 16 point FFT floating point. The Decimation-In- Time (DIT)

algorithm is chosen.

2. The algorithm is implemented in C++ language.

3. Floating Point Custom Instruction is targeted for Nios II platform and implemented in

ALTERA Cyclone II DE2 board.

4. NIOS II IDE User Interface (GUI) has been used for the purpose of interfacing with

FPGA hardware to provide inputs and display outputs.

5. Universal Serial Bus (USB) is used for transmitting and receiving data between FPGA

board.

10

6. This Embedded system is applied in Spectral Analysis as an application.

1.7 Project Contributions

The most important contributions of this project are:

1. Integration framework of ALTERA development kit platform.

2. Utilizing Nios II Floating Point Custom Instruction in the design to increase performance

and accelerate speed.

3. Created a simple protocol that is used for interaction with and communication between

hardware and software via computer serial port.

1.8 Thesis Outline

The thesis is organized into 6 chapters. The first chapter (this chapter) presents the background

of the work, problem statement, research objectives, work scope and contributions of this

project. Thesis is organized into nine chapters along with references and appendix.

Chapter 1 presents a basic overview of need for an embedded system, about co-design, FFT

algorithm it also contains motivation and proposed work for the project. The last section presents

the outline of the thesis.

Chapter 2 presents literature survey about concept of soft IP core and its use for embedded

system. It also presents the survey of different IP cores like Nios II, FFT Algorithm etc...

Chapter 3 presents an overview of Methodology, Flow chart, Algorithm, and system

designing using NIOS II processor.

Chapter 4 describes an introduction of the Fast Fourier Transform. A derivation of the FFT is

given and concentrated to radix-2 algorithm and also describes complete system description

Embedded System is first explained, followed by methodology, Nios II Floating Point Custom

Instruction, and finally implementation of FFT algorithm in C++ language.

10

Chapter 5 shows the system results and Nios II results. All results are appraised and

compared. Conclusion, which summarizes the work in this thesis.

Chapter 6 future works is also proposed, which essentially suggests ways to Improve and

extend the current design.

Embedded systems are hardware and software components working together to perform a

specific function. To design an embedded system processor should be choose a core based on the

requirements and performance constraints of their particular application. Each core has different

performance characteristics and features that are suitable for specific applications. Survey of

different cores is done in following chapter.

10

2.1 IP core

In information technology, design reuse is the inclusion of previously designed components

in software and hardware. This term is more frequently used in hardware development.

Developers can reuse a component in both similar and completely different applications, for

example a component used as part of a central processing unit for a PC could be reused in a

handheld device or a set-top box. Thus an IP core is defined as a pre-defined, pre-verified

complex functional block that is integrated into the logic of particular design. In electronic

design an IP core is a reusable unit of logic, cell, or chip layout design that is the intellectual

property of one party and is used in making a FPGA or ASIC for a product.

An IP (intellectual property) core is a block of logic or data that is used in making a field

programmable gate array ( FPGA ) or application-specific integrated circuit ( ASIC ) for a

product. As essential elements of design reuse , IP cores are part of the growing electronic design

automation ( EDA ) industry trend towards repeated use of previously designed components.

Ideally, an IP core should be entirely portable - that is, able to easily be inserted into any vendor

technology or design methodology. Universal Asynchronous Receiver/Transmitter ( UART s),

central processing units ( CPU s), Ethernet controllers, and PCI interfaces are all examples of IP

cores. One of the most important product development decisions facing SOC designers today is

choosing an intellectual property (IP) core. It can impact product performance and quality, as

well as time-to-market and profitability. But SOC designers face many challenges when

choosing a core. Determining which core is most appropriate for a given SOC requires careful

consideration. Decisions must be made about the type of core (soft vs. hard), the quality of the

deliverables, and the reliability and commitment of the IP provider. Continuing improvements in

silicon manufacturing technology have made vast amounts of silicon real estate available to

10

http://searchwinit.techtarget.com/definition/PCI

http://searchnetworking.techtarget.com/definition/Ethernet

http://searchcio-midmarket.techtarget.com/definition/CPU

http://whatis.techtarget.com/definition/UART-Universal-Asynchronous-Receiver-Transmitter

http://searchstorage.techtarget.com/definition/portability

http://whatis.techtarget.com/definition/design-reuse

http://whatis.techtarget.com/definition/design-reuse

http://searchcio-midmarket.techtarget.com/definition/ASIC

http://searchcio-midmarket.techtarget.com/definition/field-programmable-gate-array

today’s design engineers. Unfortunately, the ability of engineering teams to design circuits has

not kept pace. This imbalance has spawned the IP core industry. IP cores allow design teams to

rapidly create large system-on-a-chip designs (SOCs) by integrating pre-made blocks that do not

require any design work or verification. A number of difficult challenges accompany this new

design style. Depending on the core, they can be minimized or exacerbated. First of all, IP cores

may be delivered to customers in one of two forms: soft or hard. In both cases, the customer

receives a functionally verified design. A soft core, also known as a synthesizable core, is

synthesized by the customer and implemented in its SOC. A hard core, on the other hand, is fully

implemented and ready for manufacturing. (Technically, a design is not implemented until it is

manufactured. In this context, however, implemented means laid-out and ready for

manufacturing.) The SOC team need only drop the hard core into the chip as a single monolithic

piece. Soft and hard cores have different problems and benefits, which are addressed below. An

IP core jump-starts a key part of the SOC design task. The design team gets a verified design,

which enables them to complete their chip in less time with fewer engineering and EDA

resources. However, integrating a core into a chip requires many steps. How easily this is

accomplished, if at all, depends on the deliverables provided. This paper details some of the

collateral deliverables that enable easy core integration into all stages of the SOC design process.

Finally, there is the IP vendor to consider. The IP industry is still young and there have been a

number of poor products and even some failures, and they have not been confined to start-ups.

Consequently, a customer must evaluate not only the IP core, but also the IP provider.

2.1.1 Types of IP core

Cores can be classified in three categories: hard, firm and soft.

Soft cores

Synthesizable behavior description of complete microprocessor in hardware description

language like VHDL or Verilog is called soft core. HDL is analogous to high level

language such as C in the field of computer programming. IP cores delivered to chip

makers as RTL permit chip to modify designs (at the functional level).

Hard cores

10

Hard core is generally defined as a lower-level, physical description provided in any of a

variety of physical layout file format. These layouts must obey the target foundry's

process design rules, and hence, hard cores delivered for one foundry's process cannot be

easily ported to a different process or foundry. Such cores, whether analog or digital, are

called ’hard cores’, because the core’s application function cannot be meaningfully

modified by the customer.

Soft vs. Hard Cores

Let’s examine the pros and cons:

Because soft cores are not implemented, they are inherently more flexible in function and

implementation than hard cores. On the other hand, hard core developers can afford to

spend more time optimizing their implementations because they will be used in many

designs. Thus, there is a perception that hard cores offer higher performance.

In fact, high-end, full-custom hard cores designed for the most advanced processes do

offer more performance than soft cores. By using latches, dynamic logic, 3-state signals,

custom memories, and so on, the full-custom design team can achieve much better results

than a fully static synthesized design. For an SOC that requires performance that pushes

the limits of current process and design technology, a full-custom hard core is better able

to meet these needs. However, if the performance target is within the range of a soft core,

then the potential performance advantage of a hard core is immaterial. The SOC design

team can meet its performance goals with a soft core while taking advantage of its

inherent flexibility. (As process technology improves, the maximum frequency limits of

soft cores will also improve, making them an option for even more SOC designs.)

Even at slower clock frequencies, a hard core may offer an advantage in terms of silicon

area. But this is not always true. Often, a hard core is simply hardened using an ASIC-

style methodology, which offers no advantage in area of speed. In other cases, a full-

custom core is not re-optimized for each process generation, thus diminishing its

frequency and area advantages.

Technology Independence & Portability

10

One of the advantages of a soft core is that it is technology independent. That is to say,

the high level Verilog or VHDL does not require the use of a specific process technology

or standard cell library. This means that the same IP core can be used for multiple

designs, or for future generations of the current design. (Some soft core IP providers use

design styles that make their cores technology-dependent, but the advantages of this

approach are unclear.) A hard core, on he other hand, is very technology-specific. In fact,

if a foundry changes its process parameters or library factors, a hard core may not work

correctly with the process tweaks. This introduces risk since the IP provider will need to

re-verify the hard core when process parameters are changed. Hard cores can be ported to

new process technologies, but the effort to re-optimize full-custom cores is significant

and costly. It may take two years or more for some advanced microprocessor cores.

Because of this, hard cores are often optically scaled for new processes. While simple and

fast, this technique diminishes many of the advantages of the full-custom optimizations

done by the design team for the original process.

Furthermore, optical scaling introduces additional risk, since it only guarantees that the

new design meets design rules. It does not guarantee correct timing or function. Since the

optical scaling is a short-cut design style, it can be very difficult to fully re-verify an

optically scaled IP core. In reality, soft cores are likely designed with one technology and

library in mind. The design itself is independent of this choice of technology but it

optimized for this one technology and library. Similar technologies will be near-optimal,

but some significantly different technologies (for instance, ones with very slow RAMs)

may not see equivalent results. However, this effect is secondary. Soft cores will

generally be better optimized than optically scaled hard cores.

Speed/Area/Power Optimization

Hard cores are optimized once, when they are implemented by the IP provider. Because

the core is optimized only once, the IP provider can afford to spend significant resources.

Thus, a hard core will typically run faster than a comparable soft core for that one

technology in which it is implemented. But, even in that single technology, it is only

optimized for one set of goals. If the goal is low area at reasonable performance, the

10

highly tuned performance-optimized hard core may be too large for the application. Soft

cores, on the other hand, can be “application optimized”: Timing, area and power targets

can be adjusted to fit the specific embedded SOC design. For instance, if an SOC uses a

200- MHz clock, then a soft IP core that was designed to run at 250 MHz can be targeted

to run at exactly 200 MHz instead. This allows for smaller area and lower power while

still meeting the design constraints. This application optimization also extends down to

low-level IO timing. The IO constraints of a soft core can be adjusted to exactly fit the

environment the core will be used in. If a hard core has a late output signal, there is little

the SOC designer can do to improve that timing. If an SOC’s speed, area and power

targets are exactly what the hard core was targeted for, then that hard core will be

competitive. For the great majority of designs, however, a soft core will be better

optimized for that particular SOC.

Soft cores offer another advantage over hard cores: compile-time customizations. These

are design options chosen prior to implementation. Cache memory size is a common

compile-time customization. A soft-core processor can be configured for exactly the

cache size needed by the specific embedded application. A hard core, on the other hand,

cannot be customized in this way.

Another customization employed in many soft cores is instruction specialization, or

optional support for certain instructions. For example, support for external coprocessors

may be IP cores are released from IP core provider either as soft core, firm core or hard

cores depending on the evel of changes that the SoC designer (also called IP cores user)

can make to them, and level of transparency they come with when delivered to the final

SoC integrator. A soft core consists of a synthesizable HDL description that can be

synthesized into different semiconductor processor and design libraries. A firm core

contains more structural information, usually a gate level net list that is ready for

placement and routing. A hard core includes layout and technology depending timing

information and is ready to be dropped into a system but no changes are allowed to it.

Hard cores have usually smaller cost as they are final plug-and –play design implemented

in specific technology library and no changes are allowed in them. Tremendous verity of

IP cores of all types and functionalities is available to SoC designr. Therefore, designer

10

are given the great advantages to select from a rich pool of well-designed and carefully

verified cores and integrate them , in a plug-and-play fashion, in the system thy are

designing

Core category Changes Cost `Description

Soft core Many High HDL

Firm coore Some Medium Netlist

Hard core No Low Layout

Table 1- Evaluating the IP Provider

Evaluating the IP Provider

There are many companies that offer IP cores. Many are small, start-up design houses,

and some are large, well-established companies using IP cores as a new method of

delivering their designs to customers. Unfortunately, the size of a company is not an

indicator of IP core quality. The CO designer should verify the commitment a company

has made to IP core products.

For example, an IP provider that is not completely committed to IP cores has offerings

that may only be previous designs repackaged as IP cores. A company that is serious

about building high quality cores will design them from scratch with reuse in mind. This

section details some hallmarks of designs made to be reusable. First of all, watch out for

soft cores that are the source code for a full-custom hard core. Since these designs were

not originally made to be synthesizable, they will be poor products when compared to

those designed to be synthesizable. When making a hard core, optimizations can be made

based on the known implementation style. However, in a soft core, the implementation is

not yet done, so these shortcuts should not be taken since they may result in non-

functional or sub-optimal implementations. Another thing to look for in a soft core is

registered interface signals. By registering IOs, internal logic can be timing-independent

from anything the SOC team hooks up. Furthermore, it enables easy timing predictability

and gives very good timing constraints to the SOC designer. All of these things make 10

SOC design easier. A soft core that was designed from the beginning to be reusable will

typically have more configuration choices and more flexibility in implementation. It will

also likely be delivered with multiple design environments in mind. A design made

without reuse in mind will be less flexible in function and implementation. An IP core

with poor deliverables can also be difficult to integrate into an SOC flow. Therefore, it is

very important to evaluate the IP core deliverables to make sure the correct EDA tools

are supported and all steps of the SOC flow can be addressed properly. The choice of the

IP provider is perhaps as important as the choice of the IP core itself. An IP provider that

is making a significant commitment to IP cores is an absolute necessity. Furthermore, the

SOC team needs to know that the IP provider will be there in the future to support the

product as well as to introduce the new products. There are many challenges facing

today’s SOC designer. Using a high-quality IP core from a reputable company should

make those challenges easier, not more difficult.

Firm Core

It is structural description of a component provided in HDL. Like the hard cores, firm

core sometimes called semi-hard cores. It also carry placement data but are configurable

to various applications. It provides some retarget ability and some limited optimization.

2.1.2 Sources of IP cores

There are two sources, commercial and open source. IP core available under license

version are commercial cores. Some of the well-known commercial cores are Altera’s

Nios II, Xilinx’s MicroBlaze and PicoBlaze and Tensilica’s Xtensa. An open source code

of soft-core processor is freely available under the GNU (L) GPL license and can be

downloaded across the internet. These are LEON by Gaisler Research and Open RISC

1000 from opencores.org.

2.1.3 Comparison of soft IP cores

As part of an extensive library of cores, Altera developed Nios, a processor specifically

designed for programmable logic and system-on-a-programmable chip integration. The

Nios processor is a pipelined, single-issue RISC processor in which most instructions run

in a single clock cycle. There are two versions available, one with a native 16-bit word 10

size and one with a native 32-bit word size. There is a development kit available which

includes a C/C++ compiler, debugger, assembler, as well as other development utilities.

It also supports operating systems such as c/OS, Linux, nucleus etc [6].

Xtensa is Tensilica's best known processor IP architecture. Tensilica's Xtensa architecture

offers a key differentiating feature of a user-customizable instruction set. Using the

supplied customization tools, customers can extend the Xtensa base instruction-set by

adding new user-defined instructions. After the final processor configuration is made and

submitted, Tensilica's processor generator service builds the configured Xtensa IP core,

processor design kit, and software development kit. The software kit is built on the

Eclipse-based integrated development environment, and uses a GNU derived tool-chain.

An instruction set simulator enables customers to begin application development before

actual hardware is available.

In response to the development of soft-core processors, Xilinx introduced the

MicroBlaze processor. The Micro Blaze processor is a 32-bit RISC processor that

supports both 16-bit and 32-bit busses and supports Block Ram and/or external memory.

All peripherals including the memory controller, the UART, and the interrupt controller

run off a standard OPB bus. Additional processor performance can be achieved by

exploiting Virtex-II architecture features such as the embedded multiplier and ALU.

Xilinx also provides GNU-based tools, including a C-compiler, a debugger, and an

assembler, as well as all of the standard libraries.

The fully synthesizable code of the Open RISC 1000 processor is freely available and

was designed with an emphasis on scalability and platform independence. The

architecture consists of a 32-bit RISC Integer Unit with a configurable number of

general-purpose registers, configurable cache and TLB sizes, dynamic power

management support, and space for user provided instructions a complete GNU-based

development environment is available and includes a C-compiler, assembler, linker,

debugger, and simulator.

The LEON3 processor core is a synthesizable VHDL model of a 32-bit processor

compliant with the SPARC V8 architecture. The core is highly configurable and

10

particularly suitable for system-on-a-chip designs. The core is interfaced using the

AMBA 2.0 AHB bus and supports the IP core plug & play method provided in the

Gaisler Research IP library. The processor can be efficiently implemented on FPGA [4].

3.1 CO-DESIGN METHODOLOGY

The activity of co-design is interchangeable with that of personalization when a service user and

a provider set about creating the desired service for an individual. If an overarching objective is

to make services more personalized, than a huge amount can be learnt about the design of the

activity of personalization by working with a small number of customers.

Through such work, service designers gain insights on how people see and communicate their

needs, how they perceive the role of the provider and the requirements of the support they need.

The practical tools for personalization can be designed with those who will need to use them.

Co-design is a very public and visible process. As uncomfortable as it can feel, transparency

through greater collaboration is key to both managing expectations early and getting honest and

accurate (and therefore useful) outputs. The scale of this openness needs to be managed carefully

10

as scrutiny by too many can mean that political assuagement or appeasement overrides the

careful crafting of a solution emergent through the process of designing.

Co-design has challenged many professional designers because the idea of allowing anybody to

have a go is seen as a threat to quality as well as a denial of skill and talent. One view is that

there is some truth in this, but it’s often the case that those expressing such a concern are basing

their view on a conventional understanding of what design does, and an unclear picture of an

emerging role for design and designers.

The lone designer can solve simple problems and give form to solutions, but complex challenges

demand collaborative platforms and projects. It’s also worth remembering that tangible and

elegant solutions still need to be designed and this is the unique contribution of trained designers.

A belief is that professional designers are valuable in new ways and not to the detriment of what

designers have always done well. However the activity of designing responses to complex

challenges is too important to leave only to designers.

10

3.2FlowChart

10

Figure2- Work Flow

3.3 Algorithm

1. FFT Algorithms

2. Create the Embedded System into Altera QUARTUS II 10.1 system programmable-on-

chip (SOPC) Builder

3. Implementation of FFT Algorithms in C++ language in NIOS II IDE.

4. Create the serial connection between the Hardware and the Software.

5. Upload the system without custom instruction in FPGA (EP2C35F672C6)

6. Upload the system with custom instruction in FPGA (EP2C35F672C6)

7. Compare the results of both the systems.

3.4 System using NIOS II processor

In this project the embedded system is generated, for generation of this system NIOS II soft core

processor is used. The generation of the system using NIOS II processor is done in this chapter.

3.4.1 Introduction

The NIOS II processor is a general-purpose RISC processor core with the following

Features:

Full 32-bit instruction set, data path, and address space

32 general-purpose registers

Optional shadow register sets

32 interrupt sources

External interrupt controller interface for more interrupt sources

Single-instruction 32 × 32 multiply and divide producing a 32-bit result

Dedicated instructions for computing 64-bit and 128-bit products of multiplication

Floating-point instructions for single-precision floating-point operations

Single-instruction barrel shifter.

Access to a variety of on-chip peripherals, and interfaces to off-chip memories and

peripherals.

10

Hardware-assisted debug module enabling processor start, stop, step, and trace Under

control of the NIOS II software development tools.

Optional memory management unit (MMU) to support operating systems that

require MMUs.

Optional memory protection unit (MPU)

Software development environment based on the GNU C/C++ tool chain and the

NIOS II Software Build Tools (SBT) for Eclipse

Integration with Altera's Signal Tap® II Embedded Logic Analyzer, enabling Real-

time analysis of instructions and data along with other signals in the FPGA design.

Instruction set architecture (ISA) compatible across all NIOS II processor systems.

Performance up to 250 DMIPS

A Nios II processor system is equivalent to a microcontroller or “computer on a chip”

that includes a processor and a combination of peripherals and memory on a single chip.

A Nios II processor system consists of a Nios II processor core, a set of on-chip

peripherals, on-chip memory, and interfaces to off-chip memory, all implemented on a

single Altera device. Like a microcontroller family, all Nios II processor systems use a

consistent instruction set and programming model.

The Nios II processor is a configurable soft IP core, as opposed to a fixed, off-the-shelf

microcontroller. We can add or remove features on a system-by-system basis to meet

performance or price goals. Soft means the processor core is not fixed in silicon and can

be targeted to any Altera FPGA family. Altera provides ready-made Nios II system

designs that can use as is. If these designs meet your system requirements, there is no

need to configure the design further. In addition, we can use the Nios II instruction set

simulator to begin writing and debugging Nios II applications before the final hardware

configuration is determined. Nios II is offered in 3 different configurations: Nios II/f

(fast), Nios II/s (standard), and Nios II/e (economy).

Nios II/f

10

The Nios II/f core is designed for maximum performance at the expense of core size.

Features of Nios II/f include:

Separate instruction and data caches (512 B to 64 KB)

Optional MMU or MPU

Access to up to 2 GB of external address space

Optional tightly coupled memory for instructions and data

Six-stage pipeline to achieve maximum DMIPS/MHz

Single-cycle hardware multiply and barrel shifter

Optional hardware divide option

Dynamic branch prediction

Up to 256 custom instructions and unlimited hardware accelerators

JTAG debug module

Optional JTAG debug module enhancements, including hardware breakpoints, data

triggers, and real-time trace

Nios II/s

Nios II/s core is designed to maintain a balance between performance and cost. Features

of Nios II/s include:

Instruction cache

Up to 2 GB of external address space

Optional tightly coupled memory for instructions

Five-stage pipeline

Static branch prediction

Hardware multiply, divide, and shift options

Up to 256 custom instructions

JTAG debug module

Optional JTAG debug module enhancements, including hardware breakpoints, data

triggers, and real-time trace

10

Nios II/e

The Nios II/e core is designed for smallest possible logic utilization of FPGAs. This is

especially efficient for low-cost Cyclone II FPGA applications. Features of Nios II/e

include:

Up to 2 GB of external address space

JTAG debug module

Complete systems in fewer than 700 LEs

Optional debug enhancements

Up to 256 custom instructions

3.4.2 NIOS II Architecture

The Nios II architecture is a RISC soft core architecture which is implemented entirely in

the programmable logic and memory blocks of Altera FPGAs. The soft-core nature of the

Nios II processor lets the system designer specify and generate a custom Nios II core,

tailored for his or her specific application requirements. System designers can extend the

Nios II's basic functionality by adding a predefined memory management unit, or

defining custom instructions and custom peripherals.

The NIOS II architecture describes an instruction set architecture (ISA). The ISA in turn

necessitates a set of functional units that implement the instructions. A NIOS II processor

core is a hardware design that implements the Nios II instruction set and supports the

functional units described in this document. The processor core does not include

peripherals or the connection logic to the outside world. It includes only the circuits

required to implement the NIOS II architecture.

10

Figure3- NIOS II Core Block Diagram

The NIOS II architecture defines the following functional units:

Register file

Arithmetic logic unit (ALU)

Interface to custom instruction logic

Exception controller

Internal or external interrupt controller

Instruction bus

Data bus

Memory management unit (MMU)

Memory protection unit (MPU)

Instruction and data cache memories

Tightly-coupled memory interfaces for instructions and data

JTAG debug module

3.4.3 Processor Implementation

The functional units of the Nios II architecture form the foundation for the Nios II

instruction set. However, this does not indicate that any unit is implemented in hardware.

The Nios II architecture describes an instruction set, not a particular hardware

implementation. A functional unit can be implemented in hardware, emulated in 10

software, or omitted entirely. A Nios II implementation is a set of design choices

embodied by a particular Nios II processor core. Each implementation achieves specific

objectives, such as smaller core size or higher performance. This flexibility allows the

Nios II architecture to adapt to different target applications.

Implementation variables generally fit one of three trade-off patterns: more or less of a

feature; inclusion or exclusion of a feature; hardware implementation or software

emulation of a feature. An example of each trade-off follows:

More or less of a feature—for example, to fine-tune performance, you can increase

or decrease the amount of instruction cache memory. A larger cache increases

execution speed of large programs, while a smaller cache conserves on-chip memory

resources.

Inclusion or exclusion of a feature—For example, to reduce cost, you can choose to

omit the JTAG debug module. This decision conserves on-chip logic and memory

resources, but it eliminates the ability to use a software debugger to debug

applications.

Hardware implementation or software emulation—For example, in control

applications that rarely performs complex arithmetic, you can choose for the division

instruction to be emulated in software. Removing the divide hardware conserves on-

chip resources but increases the execution time of division operations.

3.4.4 Register File

The NIOS II architecture supports a flat register file, consisting of thirty-two 32-bit

general-purpose integer registers, and up to thirty-two 32-bit control registers. The

architecture supports supervisor and user modes that allow system code to protect the

control registers from errant applications. The NIOS II processor can optionally have one

or more shadow register sets. A shadow register set is a complete set of NIOS II general-

purpose registers. When shadow register sets are implemented, the CRS field of the status

register indicates which register set is currently in use. An instruction access to a general-

purpose register uses whichever register set is active. Typical use of shadow register sets

10

is to accelerate context switching. When shadow register sets are implemented, the NIOS

II processor has two special instructions, rdprs and wrprs, for moving data between

register sets. Shadow register sets are typically manipulated by an operating system

kernel, and are transparent to application code. A Nios II processor can have up to 63

shadow register sets.

3.4.5 Arithmetic Logic Unit

The Nios II ALU operates on data stored in general-purpose registers. ALU operations

take one or two inputs from registers, and store a result back in a register as shown in

table no.1

In hardware implementation custom instruction maps the processes in software such as

addition, subtraction, multiplication & division directly to the ALU of NIOS II processor

which is used directly as hardware. Inclusion of such hardware reduces the clock cycles

and time required for execution of algorithm.

Category Details

Arithmetic The ALU supports addition, subtraction, multiplication and division on

signed and unsigned operators.

Relational The ALU supports the equal, not-equal,greater-than-or-equal,and less-

than relational operators(==,!=,>=,<)on signed and unsigned operators.

Logical The ALU supports AND, OR, NOR and XOR logical operations.

Shift and

Rotate

The ALU supports shift and rotate operations. And can shift/rotate data y

0 to 31 bit positions per instruction. The ALU supports arithmetic shift

and logical shift right/left. The ALU supports rotate shift right/left .

Table 2- Operations Supported by the Nios II ALU

3.4.6 Exception and Interrupt Controllers

The NIOS II processor includes hardware for handling exceptions, including hardware

interrupts. It also includes an optional external interrupt controller (EIC) interface. The

10

EIC interface enables you to speed up interrupt handling in a complex system by adding a

custom interrupt controller.

3.5 Memory and I/O Organization

This section explains hardware implementation details of the NIOS II memory and I/O

organization. The discussion covers both general concepts true of all NIOS II processor systems,

as well as features that might change from system to system. The flexible nature of the NIOS II

memory and I/O organization are the most notable difference between NIOS II processor

systems and traditional microcontrollers. Because Nios II processor systems are configurable, the

memories and peripherals vary from system to system. As a result, the memory and I/O

organization varies from system to system.

A Nios II core uses one or more of the following to provide memory and I/O access:

Instruction master port—An Avalon® Memory-Mapped (Avalon-MM) master port that

connects to instruction memory via system interconnect fabric

Instruction cache—Fast cache memory internal to the Nios II core

Data master port—An Avalon-MM master port that connects to data memory and

peripherals via system interconnect fabric

Data cache—Fast cache memory internal to the Nios II core

Tightly-coupled instruction or data memory port—Interface to fast on-chip memory

outside the Nios II core.

3.5.1 Instruction and Data Buses

The NIOS II architecture supports separate instruction and data buses, classifying it as

Harvard architecture. Both the instruction and data buses are implemented as Avalon-

MM master ports that adhere to the Avalon-MM interface specification. The data master

port connects to both memory and peripheral components, while the instruction master

port connects only to memory components.

3.5.2 Memory and Peripheral Access

10

The NIOS II architecture provides memory-mapped I/O access. Both data memory and

peripherals are mapped into the address space of the data master port. The Nios II

architecture uses little-endian byte ordering. Words and half words are stored in memory

with the more-significant bytes at higher addresses. The Nios II architecture does not

specify anything about the existence of memory and peripherals; the quantity, type, and

connection of memory and peripherals are system-dependent. Typically, Nios II

processor systems contain a mix of fast on-chip memory and slower off-chip memory.

Peripherals typically reside on-chip, although interfaces to off-chip peripherals also exist.

3.5.3 Instruction Master Port

The NIOS II instruction bus is implemented as a 32-bit Avalon-MM master port. The

instruction master port performs a single function: it fetches instructions to be executed by

the processor. The instruction master port does not perform any write operations. The

instruction master port is a pipelined Avalon-MM master port. Support for pipelined

Avalon-MM transfers minimizes the impact of synchronous memory with pipeline latency

and increases the overall fMAX of the system. The Nios II processor can prefetch sequential

instructions and perform branch prediction to keep the instruction pipe as active as possible.

The instruction master port always retrieves 32 bits of data. The instruction master port

relies on dynamic bus-sizing logic contained in the system interconnect fabric. By virtue of

dynamic bus sizing, every instruction fetch returns a full instruction word, regardless of the

width of the target memory.

3.5.4 Data Master Port

The NIOS II data bus is implemented as a 32-bit Avalon-MM master port. The data master

port performs two functions:

Read data from memory or a peripheral when the processor executes a load Instruction.

Write data to memory or a peripheral when the processor executes a store Instruction.

3.5.5 Cache Memory

The NIOS II architecture supports cache memories on both the instruction master port

(instruction cache) and the data master port (data cache). Cache memory resides on-chip as

10

an integral part of the Nios II processor core. The cache memories can improve the average

memory access time for Nios II processor systems that use slow off-chip memory such as

SDRAM for program and data storage. The instruction and data caches are enabled

perpetually at run-time, but methods are provided for software to bypass the data cache so

that peripheral accesses do not return cached data. Cache management and cache coherency

are handled by software. The Nios II instruction set provides instructions for cache

management.

3.5.6 Tightly-Coupled Memory

Tightly-coupled memory provides guaranteed low-latency memory access for

performance-critical applications. Physically, a tightly-coupled memory port is a separate

master port on the NIOS II processor core, similar to the instruction or data master port.

Compared to cache memory, tightly-coupled memory provides the following benefits:

Performance similar to cache memory.

Software can guarantee that performance-critical code or data is located in

Tightly-coupled memory.

No real-time caching overhead, such as loading, invalidating, or flushing memory.

3.5.7 Address Map

The address map for memories and peripherals in a Nios II processor system is design

dependent. You specify the address map in Qsys and SOPC Builder. There are three

addresses that are part of the processor and deserve special mention:

Reset address

Exception address

Break handler address

Programmers access memories and peripherals by using macros and drivers. Therefore, the

flexible address map does not affect application developers.

10

3.5.8 Memory Management Unit

The optional NIOS II MMU provides the following features and functionality:

Virtual to physical address mapping

Memory protection

32-bit virtual and physical addresses, mapping a 4-GB virtual address space into

as much as 4 GB of physical memory

4-KB page and frame size

Low 512 MB of physical address space available for direct access

Hardware translation look aside buffers (TLBs), accelerating address translation Separate

TLBs for instruction and data accesses Read, write, and execute permissions controlled

per page Default caching behavior controlled per page TLBs acting as n-way set-

associative caches for software page tables TLB sizes and associativities configurable in

the Nios II Processor parameter editor

Format of page tables (or equivalent data structures) determined by system software

Replacement policy for TLB entries determined by system software

Write policy for TLB entries determined by system software

3.5.9 Memory Protection Unit

The optional NIOS II MPU provides the following features and functionality:

Memory protection

Up to 32 instruction regions and 32 data regions

Variable instruction and data region sizes

Read and write access permissions for data regions

Execute access permissions for instruction regions

Overlapping regions

3.6. JTAG Debug Module

10

The Nios II architecture supports a JTAG debug module that provides on-chip emulation features

to control the processor remotely from a host PC. PC-based software debugging tools

communicate with the JTAG debug module and provide facilities, such as the following features:

Downloading programs to memory

Starting and stopping execution

Analyzing registers and memory

Collecting real-time execution trace data

3.7 Embedded system generation using NIOS II processor

The Nios II development flow consists of three types of development: hardware design steps,

software design steps, and system design steps, involving both hardware and software. System

design steps involve both the hardware and software, and might require input from both sides.

Figure4-Flow chart of System generation using Nios II processor

3.8 Defining and Generating the System in SOPC Builder

After analyzing the system hardware requirements, SOPC Builder is used to specify the Nios II

processor core(s), memory, and other components your system requires. SOPC Builder

automatically generates the interconnect logic to integrate the components in the hardware

system. It can be selected from a list of standard processor cores and components provided with

10

the Nios II EDS. Custom hardware can also be added to accelerate system performance. Custom

instruction logic to the Nios II core can be added which accelerates CPU performance, or a

custom component can be added which offloads tasks from the CPU. The following step covers

adding standard processor and component cores.

The primary outputs of SOPC Builder are the following file types:

SOPC Builder Design File (.sopc) — it contains the hardware contents of the SOPC

Builder system.

SOPC Information File (.sopcinfo) — it contains a human-readable description of the

contents of the .sopc file. The Nios II EDS uses the .sopcinfo file to compile software for

the target hardware.

Hardware description language (HDL) files—are the hardware design files that describe

the SOPC Builder system. The Quartus II software uses the HDL files to compile the

overall FPGA design into an SRAM Object File (.sof).

3.8.1 Integrating the SOPC Builder System into the Quartus II Project

After generating the Nios II system using SOPC Builder, it is integrated into the Quartus

II project. Using the Quartus II software, all tasks are performed required to create the

final CYCLONE II FPGA hardware design. Using the Quartus II software, assign pin

locations for I/O signals, specify timing requirements, and apply other design constraints.

Finally, compile the Quartus II project to produce a .sof to configure the CYCLONE II

FPGA. Download the .sof to the CYCLONE II FPGA on the target board

(EP2C35F672C6) using an Altera download cable, such as the USB-Blaster™. After

configuration, the FPGA behaves as specified by the hardware design, which in this case

is a Nios II processor system. The Nios II processor and the interfaces needed to connect

to other chips on the DE2 board are implemented in the Cyclone II FPGA chip. These

components are interconnected by means of the interconnection network called the

Avalon Switch Fabric. Memory blocks in the Cyclone II device can be used to provide an

on-chip memory for the Nios II processor. They can be connected to the processor either

10

directly or through the Avalon network. The SRAM and SDRAM memory chips on the

DE2 board are accessed through the appropriate interfaces.

Input/output interfaces are instantiated to provide connection to the I/O devices used in

the system. A special JTAG UART interface is used to connect to the circuitry that

provides a Universal Serial Bus (USB) link to the host computer to which the DE2 board

is connected. This circuitry and the associated software is called the USB-Blaster.

Another module, called the JTAG Debug module, is provided to allow the host computer

to control the Nios II processor. It makes it possible to perform operations such as

downloading programs into memory, starting and stopping execution, setting program

breakpoints, and collecting real-time execution trace data. Since all parts of the Nios II

system implemented on the FPGA chip are defined by using a hardware description

language, a knowledgeable user could write such code to implement any part of the

system.

4.1 Fast Fourier Transform

In this chapter, several methods for computing the Discrete Fourier Transform (DFT) efficiently

are presented. In view of the importance of the DFT in various digital signal processing

applications, such as linear filtering, correlation analysis, and spectrum analysis, its efficient

computation is a topic that has received considerable attention by many mathematicians,

engineers, and applied scientists. Basically, the computational problem for the DFT is to

compute the sequence {X (k)} of N complex-valued numbers given another sequence of data {x

(n)} of length N, according to the following formula:

In general, the data sequence x (n) is also assumed to be complex value. Similarly, the Inverse

Discrete Fourier Transform (IDFT) becomes: We can observe that for each value of k, direct

computation of X (k) involves N complex multiplications (4N real multiplications) and N-1

complex additions (4N-2 real additions). Consequently, to compute all N values of the DFT

requires N2complex multiplications and N2−N complex additions. Direct computation of the 10

DFT is basically inefficient primarily because it does not exploit the symmetry and periodicity

properties of the phase factor WN. In particular, these two properties are:

Symmetric property

Periodicity property

The computationally efficient algorithms described in this section, known collectively as fast

Fourier transform (FFT) algorithms, exploit these two basic properties of the phase factor.

4.2 FFT Algorithms

The FFT exist in two functionally equivalent forms known as decimation in time (DIT) and

decimation in frequency (DIF). Both are a decomposition of the DFT by processing through

sample computational units and reducing the computational complexity of DFT from O (N 2) to

O (N log (N)). The various algorithms that result from the FFT are collectively known as Radix-

R Fast Fourier Transforms.

The most popular Radix r choices are those of r = 2 and r = 4, and a commonly used

advancement upon the FFT is the use of a mixed radix.

4.3 Radix-2 FFT Algorithms

The Radix-2 algorithm takes the DFT and applies a common factor reduction equating the sum

of two N/2 sequences to the N point sequence of the original DFT. Resulting in the Radix-2 FFT

formula below:

This result in processing that follows the signal flow graph as shown in Figure

10

Figure 5-Radix-2 for an N point FFT

There are two methods of radix-2 algorithm

Decimation in time FFT algorithm(DIT)

Decimation in frequency FFT algorithm(DIF)

4.3.1 Decimation-In-Time FFT

now let us consider the computation of N=2v point DFT by the divide and conquer approach.

We split the n point data sequence into N/2 point data sequences f1(n) and f2(n),

corresponding to the even numbe0 r and odd number of samples of x(n) respectively. Thus

f(n) and f2(n) are obtained by decimating x(n) by factor of 2 and hence the resulting FFT

algorithm is called decimation in time FFT algorithm.

The equation can be expressed as:

10

We observe that the direct computation of F1(k) requires ( N /2 )2 complex

multiplications. The same applies to the computation of N/2 additional

complex multiplications required computation of X(k) requires 2( N /2 )2+

N/2=N2/2+N/2 complex multiplications. This first step results in a

reduction of the number of multiplications from N2 to N2/2+N¿2Which is

about the factor of 2 for N large. By computing N/4 point DFTs , we

should obtain N/2 point DFTs F1(k) and F2(k) form the relations

The decimation of the data sequence can be repeated again and again

until the resulting sequences are reduced to one one-point sequences. For

N = 2v this decimation can be performed v=log 2 N times. Thus the total no.

of complex multiplication is reduce to (N/2) log 2 N . The number of complex

addition is Nlog 2 N . Illustration, Figure depicts the computation of N = 16

point DFT. We observe that the computation is performed in five stages,

beginning with the computation of eight 2 point DFTs, then four 4 point

DFTs, then 2 eight point DFTs. Finally one 16 point DFT.

10

. Figure 6-Butterfly structure

4.3.2 Decimation in frequency

Another important radix-2 FFT algorithm, called the decimation in

frequency algorithm, is obtained by using the divide-and-conquer

approach. To derive thealgorithm, we begin by litting the DFT formula into

two summations, one of which involves the sum over the first N/2 data

points. Thus we obtain

10

Now, let us split (decimate) x(k) into the even and odd number samples.

Thus we obtain

The computation procedure above can be repeated through decimation of the N/2 point DFTs

X(2k) and x(2k+1). The entire process involves V= log 2 N stages of decimation where

each satge involves N/2 butterflies of the type Thus the total no. of complex

multiplication is reduce to (N/2) log 2 N. The number of complex addition is N

log 2 N . Illustration, Figure depicts the computation of N = 16 point DFT. We

observe that the computation is performed in five stages, beginning with the

computation of eight 2 point DFTs, then four 4 point DFTs, then 2 eight point

DFTs. Finally one 16 point DFT. The combination for

10

Figure7-Decimation in frequency

4.4 Algorithms Implementation

In this project, we implemented both methods Decimation-In-Time (DIT) and Decimation-In-

Frequency (DIF) to verify which one is better in efficiency, speed, performance and delay. The

following researchers are known to apply the same methods:

10

Weidong Li, Jonas Carlsson, Jonas Claeson, and Lars Wanhammar (Electronics Systems,

Department of Electrical Engineering Linkö ping University) employed Fast Fourier Transform

algorithm in Global Asynchronous Local Synchronous (GALS) based on decimation-in-

frequency radix-2 algorithm. They proved in their simulation that DIF has high performance and

efficiency. Mohd Nazrin (UTM 2004) applied Fast Fourier Transform algorithm in FPGA

technology. The design was based on decimation-in-time radix-2 algorithm. Pursuant to his

simulation and results he concluded that “DIT has many advantages such as high efficiency,

speed, performance and low delay. Obviously, both methods give the same results but we are

looking for performance, speed, hardware cost and efficiency. Through this thesis we will know

the advantages and disadvantages of both the DIT and the DIF.

4.4.1 FPGA Implementation of FFT Algorithm

In hardware implementation, inclusion of custom instruction in Nios II processor is done.

While designing a system, that includes an Nios II embedded processor, we can

accelerate time-critical software algorithms by adding custom instructions to the Nios II

processor instruction set. Custom instructions allow you to reduce a complex sequence of

standard instructions to a single instruction implemented in hardware. You can use this

feature for a variety of applications, for example, to optimize software inner loops for

digital signal processing (DSP), packet header processing, and computation-intensive

applications. In SOPC Builder, the Nios II parameter editor provides a GUI to add

custom instructions to the Nios II processor.

10

Figure8-Custom Instruction Logic connects to ALU in SOPC builder system

In SOPC Builder, the custom instruction logic connects directly to the Nios II

4.4.3 Design steps for the hardware implementation

After software implementation same procedure has to be followed with some changes in

system.

1. For hardware implementation system is generated in SOPC builder with inclusion of

Custom Instruction as hardware.

2. Implementation of Algorithm is done on FPGA using NIOS II IDE.

3. The program calculates the processing time and throughput for each of the versions, to

demonstrate the improved efficiency of a custom instruction compared to software

implementation.

4.5 HARDWARE IMPLEMENTATION

In Hardware implementation, Custom Instruction is added in NIOS II Processor. The SOPC GUI

supports the inclusion of custom instructions.

10

Figure 9- Custom instruction block in SOPC GUI.

Figure 10- Inclusion of Custom Instruction in NIOS II Processor

10

Figure 10 and Figure 11 shows the addition of custom instruction in NIOS II Processor. The

inclusion of Custom Instruction is added as a floating point hardware in processor keeping all

other peripherals same, it leads to increase in hardware. The Custom Instruction maps the

memory location from SRAM interface in SOPC builder.

Figure 11-RTL view of Custom Instruction

4.5.1 Implementation of Custom Instruction Hardware and Software

Multicycle custom instructions complete in either a fixed or variable number of clock

cycles. For a custom instruction that completes in a fixed number of clock cycles, you

specify the required number of clock cycles at system generation. For a custom

instruction that requires a variable number of clock cycles, you instantiate the start and

done ports. These ports participate in a handshaking scheme to determine when the

custom instruction execution is complete. The execution this instruction is shown in

following custom instruction hardware port timing diagram.

10

Figure 12-Multicycle Custom Instruction Timing Diagram

The processor asserts the active high start port on the first clock cycle of the custom

Instruction execution. At this time, the dataa and datab ports have valid values and

remain valid throughout the duration of the custom instruction execution. The start

signal is asserted for a single clock cycle.

For a fixed length Multicycle custom instruction, after the instruction starts, the

processor waits the specified number of clock cycles, and then reads the value on the

result signal. For an n-cycle operation, the custom logic block must present valid data on

the nth rising edge after the custom instruction begins execution. For a variable length

Multicycle custom instruction, the processor waits until the active high done signal is

asserted. The processor reads the result port on the same clock edge on which done is

asserted. The custom logic block must present data on the result port on the same clock

cycle on which it asserts the done signal.

The Nios II system clock feeds the custom logic block’s clk port, and the Nios II

system’s master reset feeds the active high reset port. The reset port is asserted only when

the whole Nios II system is reset. The custom logic block must treat the active high

clk_en port as a conventional clock qualifier signal, ignoring clk while clk_en is

deasserted. The Nios II custom instruction software interface is simple and abstracts the

details of the custom instruction from the software developer. For each custom

instruction, the Nios II embedded Design Suite (EDS) generates a macro in the system

header file, system.h. You can use the macro directly in your C or C++ application code,

10

and you do not need to program assembly code to access custom instructions. Software

can also invoke custom instructions in Nios II processor assembly language. Custom

Instruction is added as floating point hardware in NIOS II processor.

4.6 Results of FFT algorithm using Custom Instruction

The System is generated in SOPC Builder. In this system the custom instruction is added as

floating point hardware.

Figure 13- System with Custom Instruction in SOPC Builder

10

Figure14- System Generation for Custom Instruction

Figure 15 and Figure 16 shows the system contents and generation of system in SOPC Builder.

After generation of system in SOPC, the pin assignment and compilation is done in QUARTUS

II.

10

Figure15 NIOS System with custom instruction Block Diagram File view

In Quartus II, The pin assignment is done by importing the file of CYCLONE II

(EP2C35F672C6) FPGA. In the figure 6.6 the address lines from SRAM memory are assigned to

custom instruction.

Figure 16-RTL view of NIOS II System

After the successful compilation of system, hardware generation in CYCLONE II FPGA is done

and time limited file is generated as shown in Figure 17.

10

Figure 17- Hardware generation of NIOS II System in CYCLONE II

After hardware generation of system using CYCLONE II FPGA, the algorithm is implemented

in NIOS II IDE. We have to follow same steps as stated in chapter 5 with addition of Custom

Instruction files in NIOS II IDE Project.

After generating hello_world.c file, we have to add custom instruction files floating_point.c,

floating_point.h, floating_point_CI.c and floating_point_SW.c in the project. Then the project

is build by command Build Project. While building a project, a macro function system.h is

generated which connects the C-code of algorithm to Custom Instruction hardware.

After successful build of project the Algorithm is implemented on CYCLONE II

(EP2C35F672C6) FPGA. This implementation is done by command Run as NIOS II

Hardware. After execution of this command the result is shown in NIOS II console window.

10

Figure 18- FFT result without Custom Instruction

In Figure 18, the output of implementation of Algorithm (Encryption) on FPGA is shown with

clock cycles and time required for execution of Algorithm (Encryption).

10

Figure 19 FFT result with Custom Instruction

In Figure 19, the output of implementation of Algorithm on FPGA is shown with clock cycles

and time required for execution of Algorithm .

The table no.3 shows the clock cycles required for execution with custom instruction and

without custom instruction.

10

Type Clock Cycles Time Required

Without

custom

instruction7212992 0.14426

With

custom

instruction5937794 0.1186

Table no. 3 Clock Cycles for execution of Algorithm

10

5.1 RESULT

The System generated in SOPC with and without Custom Instruction is compiled in Quartus II

environment. The FFT algorithm is compared in terms speed i.e. number of clock cycles required

and area in terms of number of Logic elements (LE’s) on FPGA.

Table 4 CPU Clock cycles and time required

From the performance analysis results of clock cycles and time required for execution of FFT

algorithm in software i.e. system without Custom Instruction is more as compared with

execution of algorithm in hardware i.e. system with custom instruction.

In hardware implementation custom instruction maps the processes in software such as addition,

subtraction, multiplication & division directly to the ALU of NIOS II processor which is used

directly as hardware. Inclusion of such hardware reduces the clock cycles and time required for

execution of algorithm.

The system which is generated using SOPC Builder is compiled in Quartus II software.The

hardware required for generation of system is depend upon the LE’s used in CYCLONE II

(EP2C35F672C6) FPGA . The comparison in hardware change is shown in table 6.

10

Type Clock Cycles Time Required

Without

custom

instruction

7212992 0.14426

With

custom

instruction

5937794 0.1186

Items Total

Count

Without Custom

Instruction

With Custom

Instruction

Total Logic Elements 32216 (10%) (14%)

Total Combinational

Functions

32216 (9%) (13%)

Dedicated Logic Registers 32216 (6%) (9%)

Total Pins 475 (9%) (15%)

Total Memory Bits 473840 (10%) (10%)

Embedded multipliers 9-bit

Elements

70 (6%) (16%)

Table 5 comparison of compilation report

The above table shows the comparison between the software and hardware systems i.e. system

with and without custom instruction. It shows that inclusion of custom instruction increases the

hardware which gives better result in terms of clock cycles and require for execution of

algorithm.

5.2 CONCLUSION

FFT algorithm considered as a case study is implemented using the hardware / software co-

design methodology. Hardware / software co-design methodology implementation gives an

optimized design of the algorithm. Algorithm is implemented on CYCLONE II FPGA based

around NIOS II processor. Speed of the algorithm is increased by (18.46%). The conclusions

of the entire experiment and the project are presented in this chapter.

Recommendations for enhancing the precision and performance of FFT

10

embedded system are also included in this chapter. The recommendations

include speed and logic cells requirements.

5.3 Concluding Remarks

This thesis demonstrates the design of embedded system and FPGA

technology implementation of Fast Fourier Transform algorithm. The

algorithm used was radix-2 decimation-in-time for 32-floating point. The FFT

embedded system was included the floating point custom instruction as an

alternative choice for the floating point arithmetic operation. The floating

point custom in

struction has given the system better performance and speed in the floating

point operation, which has been proven in the result. Moreover in this thesis,

I introduced a new technique to provide any kind of data to FPGA

development technology from host PC by using specified GUI by DEV-C. This

method will make the connection easier, uncomplicated, useful and practical.

Finally, our experiment thus far have demonstrated promising results,

indicating that floating point custom instructions can result in large

improvements in performance, energy, and timing, while significantly

reducing design turnaround time.

10

FUTURE SCOPE

There are countless ways that the designed FFT embedded system can be

improved; for example: by introducing Higher N-Point FFT Computation, the

algorithm architecture in the decimation-in-frequency and high radix can be

used to make the design more robust

As the selected Processor is soft core processor, enabling change of hardware according to the

application. Optimization in area, required for FFT Algorithm, can be obtained by designing

various optimization approaches for the various blocks of the algorithm.

Here, FFT algorithm is accelerated using custom instruction with NIOS II processor, in future

acceleration can be done with using C2H compiler tool.

The extension of this project could be done in the field of communication where the ever

increasing demand on signal processing capabilities which has given rise to the importance of the

Fourier transform to the field. The importance of the Discrete Fourier Transform (DFT) in

various digital signal processing applications such as linear filtering, correlation

analysis, and spectrum analysis is always felt. Advantages offered by the system are

that cyclone 2 board operates at up to 250MHz when an external device is connected another

advantage is that FPGA(EP2C35F672C6) has about 35536 logic elements which enables us to

design much more highly integrated circuits.

10

Reference

[01] Ernst, R.: “Co design of embedded systems: status and trends”, Proceedings of IEEE

Design and Test, April–June 1998, pp.45–54

[02] T. Ismail, A. Jerraya, “Synthesis Steps and Design Models for Codesign” IEEE Computer

[03] A. Kalavade, E. Lee, “A Hardware-Software Co-design Methodology for DSP

Applications,” IEEE Design and Test, vol. 10, no. 3, pp. 16-28, Sept. 1993.

[04] Edix Cetin, Richard C.S.Morling and Izzet Kale,(1997) “ An Integrated 256-point

Complex FFT Processor for Real-Time Spectrum Analysis and Measurement”, IEEE

Proceedings of Instrument and Measurement Technology Conference, Vol. 1.96-101.

[05] Hermann, D., J. Henkel, R. Ernst, “An approach to the estimation of adapted Cost

Parameters in the COSYMA System”, 3rd International Conference on

Hardware/Software co-design, Grenoble, France, September 22-24, 1994, pp. 100-107

[06] Jason G. Tong, Ian D. L. Anderson and Mohammed A. S. Khalid: Soft-Core Processors

for Embedded Systems, the 18th International Conference on Microelectronics (ICM)

2006.

[07] Coelo Jr, C. J. N., Da Silva Jr., D. C., and Fernandes, A. O. “Hardware software codesign

of embedded systems”, Proceedings of the 11th Brazilian Symposium on Integrated

Circuit Design, January 1998.

10

[08] Journal of ELECTRICAL ENGINEERING, VOL. 56, NO. 9-10, 2005, 265–269

“HARDWARE IMPLEMENTATION OF AES ALGORITHM” by Marko Mali-Franc

Novak-Anton Biasizzo.

[09] Monson H. Hayes. “Digital Signal Processing”. Schaum’s outlines 1999.

[10] MOHD NAZRIN. “The Implementation of Fast Fourier Transform (FFT) Radix-2 Core

Processor using VHDL in FPGA-Based Hardware”. UTM 2003

[11] Altera Corporation (2004a). “UART core with Avalon Interface”

[12] Altera Corporation (2007). “Nios II Processor Reference Handbook”

[13] Altera Corporation. “Using Nios II Floating Point Custom Instruction”

[14] Altera Corporation. “Nios II Custom Instruction User Guide”

[15] Altera Corporation, “Nios Software Development Tutorial”.

[16] Altera Corporation, “SOPC Builder Data Sheet”.

10

nios 2

Documents

nios ii architecture

shadow register sets

nios ii processor

discrete fourier transform

fast fourier transform

instruction set architecture

previously designed components

including hardware breakpoints