ECE 545 Project 1 Introduction & Specification. Schedule Project 1 RTL design for FPGAs (30 points) Due date: Tuesday, November 21, midnight Final choice.

ECE 545 Project 1Introduction & Specification

Schedule

Project 1 RTL design for FPGAs (30 points)

Due date: Tuesday, November 21, midnight

Final choice of the project topic: Thursday, October 19

Progress reports: Thursday-Friday, November 2-3 Thursday-Friday, November 16-17

Groups

• ONE-person and TWO-person teams allowed

• Teams must be formed at the moment when the project topic is selected, i.e., by Thursday, October 19

• TWO-person teams work on more complex versions of each project topic

• One final grade per entire team

Honor Code Rules

• Using somebody’s else code and presenting it as your own is a serious Honor Code violation and may result in an F grade for the entire course.

• All student teams are expected to write and debug their codes by themselves and are not allowed to share their codes with other teams.

• Students are encouraged to help and support each other in all problems related to the– basic understanding of the problem– operation of the CAD tools.

Project 1 - Platform & tools

Target devices: Xilinx FPGA Spartan 3 family

Tools:

VHDL Simulation: Aldec Active HDL or ModelSimVHDL Synthesis: Synplify Pro or Xilinx XSTImplementation: Xilinx ISE or Xilinx WebPack

Project 1 - Final Deliverables

1. All block diagrams and ASM chartsdescribing the entire circuit and its components(electronic form, PDF)

2. All synthesizable VHDL source codes3. All testbenches used to verify the operation of the entire

circuit and its components, and the correspondinginput files containing test vectors, and output files containing results

4. Timing waveforms demonstrating the correct operationof the entire circuit and its components

5. Final report

Final Report (1)

1. Short description of the block diagrams and ASM charts. Discussion of any alternative architectures and solutions.

2. List of source codes and a short description of major modules.

3. Source of test vectors and a way of generating these test vectors.

4. Format of input & output files. Short description of a testbench.

Final Report (2)

5. Results• resource utilization (CLB slices, LUTs, FFs,

BRAMs, etc.)• post-synthesis timing

• clock frequency• throughput• latency• critical path

• post placing & routing timing• clock frequency• throughput• latency• critical path

Final Report (3)

6. Discussion of the obtained results and and any optimizations applied in order to obtain

the optimum design.

7. Speed-up vs. software implementation.

8. Discussion of dependence of results on parameters of the application.

9. Deviations from the original specification, encountered problems, and unresolved issues.

Two topics from two different areas to choose from

Cryptography:

Digital Signal Processing:

Stream cipher qualifiedto Phase 2 of the eSTREAM contest

Finite Impulse Response Filter

Stream cipher qualifiedto Phase 2 of the eSTREAM contest

Cipher

Message / Ciphertext

Ciphertext / Message

CryptographicKey

m bits

m bits

k bits

Encrypt/Decrypt

1 bit

Secret-Key Ciphers

key of Alice and Bob - KABkey of Alice and Bob - KAB

Alice Bob

Network

Encryption Decryption

Block vs. stream ciphers

Stream cipher

memoryBlock cipher

KK

M1, M2, …, Mn m1, m2, …, mn

C1, C2, …, Cn c1, c2, …, cn

Ci=fK(Mi) ci = fK(mi, mi-1, …, m2, m1)

Every block of ciphertext is a function of only one

corresponding block of plaintext

Every block of ciphertext is a function of the current and

all proceeding blocks of plaintext

Typical stream cipher

Sender Receiver

PseudorandomKeyGenerator

mi

plaintext

ci

ciphertext

kikeystream

keyinitialization vector (seed)

PseudorandomKeyGenerator

mi

plaintext

ci

ciphertext

ki keystream

key initializationvector (seed)

eSTREAM - Contest for a new stream cipher standard, 2004-2008

PROFILE 1

• Stream cipher suitable for software implementations optimized for high speed• Key size - 128 bits• Initialization vector – 64 bits or 128 bits

PROFILE 2

• Stream cipher suitable for hardware implementations with limited memory, number of gates, or power supply• Key size - 80 bits• Initialization vector – 32 bits or 64 bits

Schedule of the contest

November 2004 Request for proposals

29 April 2005 Deadline for submissions

34 ciphers, 23 candidates for PROFILE 1

26 candidates for PROFILE 2

26-27 May 2005 Stream Cipher Workshop, Danmark

March 2006 End of Phase I

July 2006 Beginning of the evaluation part of Phase II

September 2007 End of Phase II

January 2008 Final report

time

eSTREAM - Contest for a new stream cipher standard, 2004-2008

http://www.ecrypt.eu.org/stream/timetable.html

10 focus candidatesPROFILE 1 (Software)Dragon - Ed Dawson, Kevin Chen, Matt Henricksen, William Millan, Leonie Simpson, HoonJae Lee, SangJae MoonHC-256 - Hongjun WuLEX - Alex BiryukovPhelix - Doug Whiting, Bruce Schneier, Stefan Lucks, Frédéric MullerPy - Eli Biham and Jennifer SeberrySalsa20 - Daniel BernsteinSOSEMANUK - Come Berbain, Olivier Billet, Anne Canteaut, Nicolas Courtois, Henri Gilbert, Louis Goubin, Aline Gouget, Louis Granboulan, Cédric Lauradoux, Marine Minier, Thomas Pornin, Hervé Sibert

PROFILE 2 (Hardware)Grain - Martin Hell, Thomas Johansson and Willi MeierMICKEY-128 - Steve Babbage and Matthew DoddPhelix - Doug Whiting, Bruce Schneier, Stefan Lucks, Frédéric MullerTrivium - Christophe De Cannière and Bart Preneel

Your task

For groups of the size ONE

For groups of the size TWO

implement ONE out of the following FIVE ciphers

implement TWO out of the following FIVE ciphers

Grain, MICKEY-128, Phelix, Salsa, Trivium

Optimization Criteria

Maximum ratio

Throughput divided by

Total Circuit Area [CLB slices]

I. Minimum area

II.

eSTREAMcipher

clk

reset

enc_dec

data_in

data_in_ready

data_in_write

d

data_out

writefull

d

Required interface

key_IV

key_IV_ready

key_IV_write

k

k=1, 2, 4, 8, 16, 32, 64d – set of allowed values specific to a given algorithm

Tasks of a TWO-person team

• Implement TWO ciphers

• Compare TWO ciphers against each other

eSTRAMImplementation Hints

Example of an eSTRAM cipher

Linear Feedback Shift Register (LFSR)

L, C(D)

Connection polynomial, C(D)

C(D) = 1 + c1D + c2D2 + . . . + cLDL

Length

4, 1+D+D4

Connection polynomial, C(D)Length

Example of LFSR

Initial state[sL-1, sL-2, . . . , s1, s0]

LSFR recursion:

sj = c1sj-1 c2sj-2 . . . cL-1sj-(L-1) cLsj-L

for j L

sj-1 sj-2 sj-(L-1) sj-L

LFSR State Sequence

Non-linear Feedback Shift Register (NFSR)

Doubling the speed of Grain

Resources

eSTREAM PHASE 2 –the ECRYPT Stream Cipher Project

available at

http://www.ecrypt.eu.org/stream/

Source of test vectors

Reference C implementations provided bythe authors of the algorithms.




Finite Impulse Response Filter

Topic proposed and co-advised by:

Dr. David Hwang Dr. Kathleen Wage

DSP Project: FIR Digital Filter Design

• Digital filters are widely used in digital communications and audio/video processing.

• In particular, finite impulse response (FIR) filters are used for their ease of implementation and stability.

• In this project, you will investigate different FIR filter structures and their VLSI implementations– Step 1: Implement and compare direct form versus

direct form transposed structures– Step 2: Implement and compare fast FIR structures

which reduce the number of required multiplications per sample

Example: Gigabit Ethernet Transceiver

• As seen above digital filters, boxed in blue, play a crucial role in digital communication chips such as Ethernet transceivers, cable modems, DSL modems, satellite receivers, mobile phones, etc.

x(n) Z-1 Z-1 Z-1

h0 h1 h2 hN-1

Step 1a: Direct Form FIR Filter

• An FIR filter implements a convolution in the time-domain• Critical path of N-tap filter:

– N-1 adds + 1 multiply• Arithmetic complexity of N-tap filter modeled as:

– N multiplications/sample + N-1 adds/sample• Problem 1a: Design a parametrizable direct form FIR filter

y(n)

Step 1b: Direct Form Transpose FIR Filter

• Use a signal flow graph reversal to reduce the critical path transpose structure

• Critical path of N-tap transposed filter:– 1 add + 1 multiply

• Arithmetic complexity of N-tap filter modeled as:– N multiplications/sample + N-1 adds/sample

• Problem 1b: Design a parametrizable direct form transpose FIR filter

x(n)

Z-1 Z-1 Z-1

hN-1 hN-2 hN-3 h0

y(n)

x(2n) H0(z)

H0(z)+H1(z)

x(2n+1) H1(z)

N/2 taps

Z-1

y(2n)

y(2n+1)

Step 2: Power Reduction via Parallel Subexpression Sharing

• Direct form and transpose form structures (running at the same rate) require N multiplications/sample and N-1 adds/sample

• Methods exist to reduce this complexity by parallel processing and subexpression sharing. See [1] and [2] for details and derivation.– In the 2-parallel structure above, two inputs arrive at half the original clock

rate and are processed in parallel by three ceil(N/2)-tap filters [ceil() is the ceiling function]

– Arithmetic complexity of the 2-parallel filter is approximately: 3 x N/2 multiplications / two samples + 3 x (N/2-1) adds / two samples + 4 adds / two samples = 3/4 N multiplications/sample + (3N/4 + 1/2) adds/sample

– If power is dominated by multipliers, 25% power savings over traditional structures!

• Problem 2a: Design a 2-parallel parametrizable FIR filter

Obtaining Coefficients of 2-Parallel Subfilters

• Example for N = 8– H(z) = {h0, h1, h2, h3, h4, h5, h6, h7}

• Subfilter coefficients obtained by performing a polyphase decomposition by 2. Each subfilter has N/2 = 4 coefficients:– H0(z) = {h0, h2, h4, h6}

– H1(z) = {h1, h3, h5, h7}

– H0(z) + H1(z) = {h0+h1, h2+h3, h4+h5, h6+h7}

H0(z)

H1(z)

H2(z)

H0(z) + H1(z)

H1(z) + H2(z)

H0(z) + H1(z) + H2(z)

N/3 taps

x(3n)

x(3n+1)

x(3n+2) Z-1

Z-1

y(3n)

y(3n+1)

y(3n+2)

3-parallel filter

• In the 3-parallel filter, three inputs arriving at a third of the original rate are processed by six parallel ceil(N/3)-tap filters

• Arithmetic complexity of the 3-parallel filter is approximately:– 2/3 N multiplications/sample + (2/3N + 4/3) adds– 33% reduction in multiplications/sample

• Problem 2b: Design a 3-parallel parametrizable FIR filter

Obtaining Coefficients of 3-Parallel Subfilters

• Example for N = 9– H(z) = {h0, h1, h2, h3, h4, h5, h6, h7, h8 }

• Subfilter coefficients obtained by performing a polyphase decomposition by 3. Each subfilter has N/3 = 3 coefficients:– H0(z) = {h0, h3, h6}– H1(z) = {h1, h4, h7}– H2(z) = {h2, h5, h8}– H0(z) + H1(z) = {h0+h1, h3+h4, h6+h7}– H1(z) + H2(z) = {h1+h2, h4+h5, h7+h8}– H0(z) + H1(z) + H2(z) = {h0+h1+h2, h3+h4+h5, h6+h7+h8}

Further parallelism

• These parallel structures introduce issues such as increased area, adder overhead (pre- and post-processing), etc. which eventually become prohibitive as the subsampling rate increases

Assumptions

All coefficients are loaded to the circuitbefore the start of processing and do notchange during the runtime.

Registers storing coefficients are connected in chain, so coefficients must be loadedserially, in the proper order, startingfrom the ones with the smallest indices.

Parameters of the design

N: number of taps (N=8, 12, 16, 24, 32)

M: fractional wordlength of input (M=8..10)

K: fractional wordlength of output (K=8..10)

L: fractional wordlength of coefficients (L=7-11)

FIR Filter

clk

reset_datapath

d_in1.M

filt_mode

d_out

1.K

Required interface - basic architecture

load_begin

load_coeff_done

reset_coeff

coeff1.L

( 0=load coefficients, 1=run filter)

( 0=idle, 1=start to load coefficients)

FIR Filter

clk

reset_datapath

d_in_11.M

filt_mode

d_out_1

1.K

Required interface – 2-parallel structure

load_begin

load_coeff_done

reset_coeff

coeff1.L

( 0=load coefficients, 1=run filter)

( 0=idle, 1=start to load coefficients)

d_in_21.M

d_out_2

1.K

One-Person Team Requirements• Matlab code will be given for five different configurations (A, B, C, D, E),

each with different values of N, M, L, and K.– CASE A: N = 8, M = 8, K = 8, L = 7– CASE B: N = 12, M = 9, K = 9, L = 8– CASE C: N = 16, M = 9, K = 10, L = 9– CASE D: N = 24, M = 10, K = 11, L = 10– CASE E: N = 32, M = 11, K = 12, L = 11

• Step 1: Direct form and transpose form structures:– Generate parametrizable VHDL code; round output of each multiplier to K

fractional bits– Generate test vectors using Matlab and verify the test vectors in RTL for

configurations A-E– Implement configurations B and D on FPGA

• Optimize for minimum area• Optimize for maximum ratio of: throughput / area (CLB slices)

• Step 2: 2-parallel and 3-parallel fast FIR structures– Generate parametrizable VHDL code; round output of each multiplier to K

fractional bits – Generate test vectors using Matlab and verify the test vectors in RTL for

configurations B and D– Implement configurations B and D on FPGA


Two-Person Team Additional Requirements

• Step 3: 4-parallel and 6-parallel fast FIR structures. See ref [2] for block diagrams.

– Generate parametrizable VHDL code; round output of each multiplier to K fractional bits

– Generate test vectors using Matlab and verify the test vectors in RTL for configurations B and D

– Implement configurations B and D on FPGA• Optimize for minimum area• Optimize for maximum ratio of: throughput / area (CLB slices)

• Step 4: Quantization studies– For the 6-parallel filter and configurations B and D, implement truncation

instead of rounding after the multipliers.• Optimize for minimum area• Optimize for maximum ratio of: throughput / area (CLB slices)

– For the 4-parallel filter and configurations B and D, round to K+4 bits after the multipliers. Round again to K bits right before the filter outputs to produce a 1.K output.


Required reading

[1] Z. Mou and P. Duhamel, “Short-length FIR filters and their use in fast nonrecursive filtering,” IEEE Transactions on Signal Processing, vol. 39, no. 6, pp. 1322-1332, June 1991.[2] K.K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation, John Wiley, pp. 256-275, 1999.

Source of test vectors

Matlab implementation – provided by Dr. Hwang

Important Notes on Two’s Complement Arithmetic

Project Notation

• For this project, we are using two’s complement fractional notation• An m.M number indicates a two’s complement m+M bit word with m

integer bits and M fractional bits• Example: 1.4 number

– 0.111 = +0.875– 1.000 = -1– 1.111 = -0.125

• Example: 2.2 number– 00.11 = +0.75– 10.00 = -2– 01.01 = +1.25

• The dynamic range of an m.M number is [-2m-1, 2m-1)

Two’s Complement Multiplication

• The wordlength required for the product of 1.M x 1.L numbers = – 2.(M+L) if we assume -1 x -1 = +1 may occur– 1.(M+L) if we assume -1 x -1 = +1 will never occur

• In general a product of m.M x l.L numbers = – (m+l).(M+L) if assume (most neg value of a) x (most neg value of b) may occur– (m+l-1).(M+L) if assume (most neg value of a) x (most neg value of b) will never

occur• In this project, we assume that (most neg value of a) x (most neg

value of b) will never occur for any multiplier in any filter structure. This is guaranteed by scaling the inputs and coefficients properly in Matlab.

– Examples: 1.5 x 2.5 = 2.10, 1.4 x 1.6 = 1.10, 3.4 x 2.3 = 4.7

a

b

1.M

1.L

1.M+L

Two’s Complement Truncation versus Rounding

• In this project, we ask you to round the output of each multiplier to K fractional bits.

• To round a k.K’ number to a k.K number (K < K’):– Truncate the k.K’ number to become a k.K number– Add the former fractional K+1 bit to fractional position K

• For information purposes, to truncate a k.K’ number to a k.K number (K < K’):– Truncate the k.K’ number to become a k.K number

• Rounding and truncation produce equal noise variance, whereas rounding is (approximately) unbiased and truncation is biased

Truncation versus Rounding:Example: 2.5 number to a 2.3 number

00.01110 +100.100

11.01000 +011.010

ROUNDING

TRUNCATION

00.01110 00.011

11.0100011.010

10.00110 +110.010

10.0011010.001

Two’s Complement Addition

• FIR filters perform chains of additions• A k.K number plus a k.K number requires a (k+1).K number to

represent the sum– Ex. 0.111 (0.75) + 0.111 (0.75) = 01.100 (1.5)– Ex. 1.000 (-1) + 1.000 (-1) = 10.000 (-2)

• In general, an adder chain summing J numbers, each of wordlength k.K, requires a wordlength of (k + ceil(log2(J)).K after the final adder

– This grows for a large number of coefficients N

x(n) Z-1 Z-1

h0 h1

1.M 1.M

1.N

ROUND

1.M+N

ROUND

1.K 1.K 1.K 1.K

. . .

y(n)

Two’s Complement Adder Chain Trick using Modulo Arithmetic

• Trick: if we know output of adder is bounded within a k’.K value (where k’ is some known value), then all intermediate addition nodes only require k’.K bit wordlengths

– Provides hardware savings for large number of coefficients N!• This is only true if we know the output of the adder chain is bounded

– Be careful, because x(2n) + x(2n+1) is not guaranteed to be bounded in 1.M; you need the full 2.M– h(0) + h(1) is not guaranteed to be bounded in 1.L; you need the full 2.L– In this project, this trick helps after multiplier outputs, not on multiplier inputs

• In our project, the final output y(n) is bounded within a 1.K bit wordlength. This has been controlled by scaling the inputs and coefficients in Matlab.

• To learn about more helpful hardware “tricks” take ECE 645 next semester!

1.K 1.K 1.K 1.K

y(n)

1.K 1.K 1.K 1.K

y(n)

2.K 3.K 3.K

1.K 1.K 1.K

ECE 545 Project 1 Introduction & Specification. Schedule Project 1 RTL design for FPGAs (30 points) Due date: Tuesday, November 21, midnight Final choice.

Documents

entire circuit

project topic

person teams

entire course

source of test vectors

student teams

m1every block of ciphertext

xilinx xstimplementation