ECE 545 Project 1 Introduction & Specification
Jan 16, 2016
ECE 545 Project 1Introduction & Specification
Schedule
Project 1 RTL design for FPGAs (30 points)
Due date: Tuesday, November 21, midnight
Final choice of the project topic: Thursday, October 19
Progress reports: Thursday-Friday, November 2-3 Thursday-Friday, November 16-17
Groups
• ONE-person and TWO-person teams allowed
• Teams must be formed at the moment when the project topic is selected, i.e., by Thursday, October 19
• TWO-person teams work on more complex versions of each project topic
• One final grade per entire team
Honor Code Rules
• Using somebody’s else code and presenting it as your own is a serious Honor Code violation and may result in an F grade for the entire course.
• All student teams are expected to write and debug their codes by themselves and are not allowed to share their codes with other teams.
• Students are encouraged to help and support each other in all problems related to the– basic understanding of the problem– operation of the CAD tools.
Project 1 - Platform & tools
Target devices: Xilinx FPGA Spartan 3 family
Tools:
VHDL Simulation: Aldec Active HDL or ModelSimVHDL Synthesis: Synplify Pro or Xilinx XSTImplementation: Xilinx ISE or Xilinx WebPack
Project 1 - Final Deliverables
1. All block diagrams and ASM chartsdescribing the entire circuit and its components(electronic form, PDF)
2. All synthesizable VHDL source codes3. All testbenches used to verify the operation of the entire
circuit and its components, and the correspondinginput files containing test vectors, and output files containing results
4. Timing waveforms demonstrating the correct operationof the entire circuit and its components
5. Final report
Final Report (1)
1. Short description of the block diagrams and ASM charts. Discussion of any alternative architectures and solutions.
2. List of source codes and a short description of major modules.
3. Source of test vectors and a way of generating these test vectors.
4. Format of input & output files. Short description of a testbench.
Final Report (2)
5. Results• resource utilization (CLB slices, LUTs, FFs,
BRAMs, etc.)• post-synthesis timing
• clock frequency• throughput• latency• critical path
• post placing & routing timing• clock frequency• throughput• latency• critical path
Final Report (3)
6. Discussion of the obtained results and and any optimizations applied in order to obtain
the optimum design.
7. Speed-up vs. software implementation.
8. Discussion of dependence of results on parameters of the application.
9. Deviations from the original specification, encountered problems, and unresolved issues.
Two topics from two different areas to choose from
Cryptography:
Digital Signal Processing:
Stream cipher qualifiedto Phase 2 of the eSTREAM contest
Finite Impulse Response Filter
Stream cipher qualifiedto Phase 2 of the eSTREAM contest
Cipher
Message / Ciphertext
Ciphertext / Message
CryptographicKey
m bits
m bits
k bits
Encrypt/Decrypt
1 bit
Secret-Key Ciphers
key of Alice and Bob - KABkey of Alice and Bob - KAB
Alice Bob
Network
Encryption Decryption
Block vs. stream ciphers
Stream cipher
memoryBlock cipher
KK
M1, M2, …, Mn m1, m2, …, mn
C1, C2, …, Cn c1, c2, …, cn
Ci=fK(Mi) ci = fK(mi, mi-1, …, m2, m1)
Every block of ciphertext is a function of only one
corresponding block of plaintext
Every block of ciphertext is a function of the current and
all proceeding blocks of plaintext
Typical stream cipher
Sender Receiver
PseudorandomKeyGenerator
mi
plaintext
ci
ciphertext
kikeystream
keyinitialization vector (seed)
PseudorandomKeyGenerator
mi
plaintext
ci
ciphertext
ki keystream
key initializationvector (seed)
eSTREAM - Contest for a new stream cipher standard, 2004-2008
PROFILE 1
• Stream cipher suitable for software implementations optimized for high speed• Key size - 128 bits• Initialization vector – 64 bits or 128 bits
PROFILE 2
• Stream cipher suitable for hardware implementations with limited memory, number of gates, or power supply• Key size - 80 bits• Initialization vector – 32 bits or 64 bits
Schedule of the contest
November 2004 Request for proposals
29 April 2005 Deadline for submissions
34 ciphers, 23 candidates for PROFILE 1
26 candidates for PROFILE 2
26-27 May 2005 Stream Cipher Workshop, Danmark
March 2006 End of Phase I
July 2006 Beginning of the evaluation part of Phase II
September 2007 End of Phase II
January 2008 Final report
time
eSTREAM - Contest for a new stream cipher standard, 2004-2008
http://www.ecrypt.eu.org/stream/timetable.html
10 focus candidatesPROFILE 1 (Software)Dragon - Ed Dawson, Kevin Chen, Matt Henricksen, William Millan, Leonie Simpson, HoonJae Lee, SangJae MoonHC-256 - Hongjun WuLEX - Alex BiryukovPhelix - Doug Whiting, Bruce Schneier, Stefan Lucks, Frédéric MullerPy - Eli Biham and Jennifer SeberrySalsa20 - Daniel BernsteinSOSEMANUK - Come Berbain, Olivier Billet, Anne Canteaut, Nicolas Courtois, Henri Gilbert, Louis Goubin, Aline Gouget, Louis Granboulan, Cédric Lauradoux, Marine Minier, Thomas Pornin, Hervé Sibert
PROFILE 2 (Hardware)Grain - Martin Hell, Thomas Johansson and Willi MeierMICKEY-128 - Steve Babbage and Matthew DoddPhelix - Doug Whiting, Bruce Schneier, Stefan Lucks, Frédéric MullerTrivium - Christophe De Cannière and Bart Preneel
Your task
For groups of the size ONE
For groups of the size TWO
implement ONE out of the following FIVE ciphers
implement TWO out of the following FIVE ciphers
Grain, MICKEY-128, Phelix, Salsa, Trivium
Optimization Criteria
Maximum ratio
Throughput divided by
Total Circuit Area [CLB slices]
I. Minimum area
II.
eSTREAMcipher
clk
reset
enc_dec
data_in
data_in_ready
data_in_write
d
data_out
writefull
d
Required interface
key_IV
key_IV_ready
key_IV_write
k
k=1, 2, 4, 8, 16, 32, 64d – set of allowed values specific to a given algorithm
Tasks of a TWO-person team
• Implement TWO ciphers
• Compare TWO ciphers against each other
eSTRAMImplementation Hints
Example of an eSTRAM cipher
Linear Feedback Shift Register (LFSR)
L, C(D)
Connection polynomial, C(D)
C(D) = 1 + c1D + c2D2 + . . . + cLDL
Length
4, 1+D+D4
Connection polynomial, C(D)Length
Example of LFSR
Initial state[sL-1, sL-2, . . . , s1, s0]
LSFR recursion:
sj = c1sj-1 c2sj-2 . . . cL-1sj-(L-1) cLsj-L
for j L
sj-1 sj-2 sj-(L-1) sj-L
LFSR State Sequence
Non-linear Feedback Shift Register (NFSR)
Doubling the speed of Grain
Resources
eSTREAM PHASE 2 –the ECRYPT Stream Cipher Project
available at
http://www.ecrypt.eu.org/stream/
Source of test vectors
Reference C implementations provided bythe authors of the algorithms.
Finite Impulse Response Filter
Topic proposed and co-advised by:
Dr. David Hwang Dr. Kathleen Wage
DSP Project: FIR Digital Filter Design
• Digital filters are widely used in digital communications and audio/video processing.
• In particular, finite impulse response (FIR) filters are used for their ease of implementation and stability.
• In this project, you will investigate different FIR filter structures and their VLSI implementations– Step 1: Implement and compare direct form versus
direct form transposed structures– Step 2: Implement and compare fast FIR structures
which reduce the number of required multiplications per sample
Example: Gigabit Ethernet Transceiver
• As seen above digital filters, boxed in blue, play a crucial role in digital communication chips such as Ethernet transceivers, cable modems, DSL modems, satellite receivers, mobile phones, etc.
x(n) Z-1 Z-1 Z-1
h0 h1 h2 hN-1
Step 1a: Direct Form FIR Filter
• An FIR filter implements a convolution in the time-domain• Critical path of N-tap filter:
– N-1 adds + 1 multiply• Arithmetic complexity of N-tap filter modeled as:
– N multiplications/sample + N-1 adds/sample• Problem 1a: Design a parametrizable direct form FIR filter
y(n)
Step 1b: Direct Form Transpose FIR Filter
• Use a signal flow graph reversal to reduce the critical path transpose structure
• Critical path of N-tap transposed filter:– 1 add + 1 multiply
• Arithmetic complexity of N-tap filter modeled as:– N multiplications/sample + N-1 adds/sample
• Problem 1b: Design a parametrizable direct form transpose FIR filter
x(n)
Z-1 Z-1 Z-1
hN-1 hN-2 hN-3 h0
y(n)
x(2n) H0(z)
H0(z)+H1(z)
x(2n+1) H1(z)
N/2 taps
Z-1
y(2n)
y(2n+1)
Step 2: Power Reduction via Parallel Subexpression Sharing
• Direct form and transpose form structures (running at the same rate) require N multiplications/sample and N-1 adds/sample
• Methods exist to reduce this complexity by parallel processing and subexpression sharing. See [1] and [2] for details and derivation.– In the 2-parallel structure above, two inputs arrive at half the original clock
rate and are processed in parallel by three ceil(N/2)-tap filters [ceil() is the ceiling function]
– Arithmetic complexity of the 2-parallel filter is approximately: 3 x N/2 multiplications / two samples + 3 x (N/2-1) adds / two samples + 4 adds / two samples = 3/4 N multiplications/sample + (3N/4 + 1/2) adds/sample
– If power is dominated by multipliers, 25% power savings over traditional structures!
• Problem 2a: Design a 2-parallel parametrizable FIR filter
Obtaining Coefficients of 2-Parallel Subfilters
• Example for N = 8– H(z) = {h0, h1, h2, h3, h4, h5, h6, h7}
• Subfilter coefficients obtained by performing a polyphase decomposition by 2. Each subfilter has N/2 = 4 coefficients:– H0(z) = {h0, h2, h4, h6}
– H1(z) = {h1, h3, h5, h7}
– H0(z) + H1(z) = {h0+h1, h2+h3, h4+h5, h6+h7}
H0(z)
H1(z)
H2(z)
H0(z) + H1(z)
H1(z) + H2(z)
H0(z) + H1(z) + H2(z)
N/3 taps
x(3n)
x(3n+1)
x(3n+2) Z-1
Z-1
y(3n)
y(3n+1)
y(3n+2)
3-parallel filter
• In the 3-parallel filter, three inputs arriving at a third of the original rate are processed by six parallel ceil(N/3)-tap filters
• Arithmetic complexity of the 3-parallel filter is approximately:– 2/3 N multiplications/sample + (2/3N + 4/3) adds– 33% reduction in multiplications/sample
• Problem 2b: Design a 3-parallel parametrizable FIR filter
Obtaining Coefficients of 3-Parallel Subfilters
• Example for N = 9– H(z) = {h0, h1, h2, h3, h4, h5, h6, h7, h8 }
• Subfilter coefficients obtained by performing a polyphase decomposition by 3. Each subfilter has N/3 = 3 coefficients:– H0(z) = {h0, h3, h6}– H1(z) = {h1, h4, h7}– H2(z) = {h2, h5, h8}– H0(z) + H1(z) = {h0+h1, h3+h4, h6+h7}– H1(z) + H2(z) = {h1+h2, h4+h5, h7+h8}– H0(z) + H1(z) + H2(z) = {h0+h1+h2, h3+h4+h5, h6+h7+h8}
Further parallelism
• These parallel structures introduce issues such as increased area, adder overhead (pre- and post-processing), etc. which eventually become prohibitive as the subsampling rate increases
Assumptions
All coefficients are loaded to the circuitbefore the start of processing and do notchange during the runtime.
Registers storing coefficients are connected in chain, so coefficients must be loadedserially, in the proper order, startingfrom the ones with the smallest indices.
Parameters of the design
N: number of taps (N=8, 12, 16, 24, 32)
M: fractional wordlength of input (M=8..10)
K: fractional wordlength of output (K=8..10)
L: fractional wordlength of coefficients (L=7-11)
FIR Filter
clk
reset_datapath
d_in1.M
filt_mode
d_out
1.K
Required interface - basic architecture
load_begin
load_coeff_done
reset_coeff
coeff1.L
( 0=load coefficients, 1=run filter)
( 0=idle, 1=start to load coefficients)
FIR Filter
clk
reset_datapath
d_in_11.M
filt_mode
d_out_1
1.K
Required interface – 2-parallel structure
load_begin
load_coeff_done
reset_coeff
coeff1.L
( 0=load coefficients, 1=run filter)
( 0=idle, 1=start to load coefficients)
d_in_21.M
d_out_2
1.K
One-Person Team Requirements• Matlab code will be given for five different configurations (A, B, C, D, E),
each with different values of N, M, L, and K.– CASE A: N = 8, M = 8, K = 8, L = 7– CASE B: N = 12, M = 9, K = 9, L = 8– CASE C: N = 16, M = 9, K = 10, L = 9– CASE D: N = 24, M = 10, K = 11, L = 10– CASE E: N = 32, M = 11, K = 12, L = 11
• Step 1: Direct form and transpose form structures:– Generate parametrizable VHDL code; round output of each multiplier to K
fractional bits– Generate test vectors using Matlab and verify the test vectors in RTL for
configurations A-E– Implement configurations B and D on FPGA
• Optimize for minimum area• Optimize for maximum ratio of: throughput / area (CLB slices)
• Step 2: 2-parallel and 3-parallel fast FIR structures– Generate parametrizable VHDL code; round output of each multiplier to K
fractional bits – Generate test vectors using Matlab and verify the test vectors in RTL for
configurations B and D– Implement configurations B and D on FPGA
• Optimize for minimum area• Optimize for maximum ratio of: throughput / area (CLB slices)
Two-Person Team Additional Requirements
• Step 3: 4-parallel and 6-parallel fast FIR structures. See ref [2] for block diagrams.
– Generate parametrizable VHDL code; round output of each multiplier to K fractional bits
– Generate test vectors using Matlab and verify the test vectors in RTL for configurations B and D
– Implement configurations B and D on FPGA• Optimize for minimum area• Optimize for maximum ratio of: throughput / area (CLB slices)
• Step 4: Quantization studies– For the 6-parallel filter and configurations B and D, implement truncation
instead of rounding after the multipliers.• Optimize for minimum area• Optimize for maximum ratio of: throughput / area (CLB slices)
– For the 4-parallel filter and configurations B and D, round to K+4 bits after the multipliers. Round again to K bits right before the filter outputs to produce a 1.K output.
• Optimize for minimum area• Optimize for maximum ratio of: throughput / area (CLB slices)
Required reading
[1] Z. Mou and P. Duhamel, “Short-length FIR filters and their use in fast nonrecursive filtering,” IEEE Transactions on Signal Processing, vol. 39, no. 6, pp. 1322-1332, June 1991.[2] K.K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation, John Wiley, pp. 256-275, 1999.
Source of test vectors
Matlab implementation – provided by Dr. Hwang
Important Notes on Two’s Complement Arithmetic
Project Notation
• For this project, we are using two’s complement fractional notation• An m.M number indicates a two’s complement m+M bit word with m
integer bits and M fractional bits• Example: 1.4 number
– 0.111 = +0.875– 1.000 = -1– 1.111 = -0.125
• Example: 2.2 number– 00.11 = +0.75– 10.00 = -2– 01.01 = +1.25
• The dynamic range of an m.M number is [-2m-1, 2m-1)
Two’s Complement Multiplication
• The wordlength required for the product of 1.M x 1.L numbers = – 2.(M+L) if we assume -1 x -1 = +1 may occur– 1.(M+L) if we assume -1 x -1 = +1 will never occur
• In general a product of m.M x l.L numbers = – (m+l).(M+L) if assume (most neg value of a) x (most neg value of b) may occur– (m+l-1).(M+L) if assume (most neg value of a) x (most neg value of b) will never
occur• In this project, we assume that (most neg value of a) x (most neg
value of b) will never occur for any multiplier in any filter structure. This is guaranteed by scaling the inputs and coefficients properly in Matlab.
– Examples: 1.5 x 2.5 = 2.10, 1.4 x 1.6 = 1.10, 3.4 x 2.3 = 4.7
a
b
1.M
1.L
1.M+L
Two’s Complement Truncation versus Rounding
• In this project, we ask you to round the output of each multiplier to K fractional bits.
• To round a k.K’ number to a k.K number (K < K’):– Truncate the k.K’ number to become a k.K number– Add the former fractional K+1 bit to fractional position K
• For information purposes, to truncate a k.K’ number to a k.K number (K < K’):– Truncate the k.K’ number to become a k.K number
• Rounding and truncation produce equal noise variance, whereas rounding is (approximately) unbiased and truncation is biased
Truncation versus Rounding:Example: 2.5 number to a 2.3 number
00.01110 +100.100
11.01000 +011.010
ROUNDING
TRUNCATION
00.01110 00.011
11.0100011.010
10.00110 +110.010
10.0011010.001
Two’s Complement Addition
• FIR filters perform chains of additions• A k.K number plus a k.K number requires a (k+1).K number to
represent the sum– Ex. 0.111 (0.75) + 0.111 (0.75) = 01.100 (1.5)– Ex. 1.000 (-1) + 1.000 (-1) = 10.000 (-2)
• In general, an adder chain summing J numbers, each of wordlength k.K, requires a wordlength of (k + ceil(log2(J)).K after the final adder
– This grows for a large number of coefficients N
x(n) Z-1 Z-1
h0 h1
1.M 1.M
1.N
ROUND
1.M+N
ROUND
1.K 1.K 1.K 1.K
. . .
y(n)
Two’s Complement Adder Chain Trick using Modulo Arithmetic
• Trick: if we know output of adder is bounded within a k’.K value (where k’ is some known value), then all intermediate addition nodes only require k’.K bit wordlengths
– Provides hardware savings for large number of coefficients N!• This is only true if we know the output of the adder chain is bounded
– Be careful, because x(2n) + x(2n+1) is not guaranteed to be bounded in 1.M; you need the full 2.M– h(0) + h(1) is not guaranteed to be bounded in 1.L; you need the full 2.L– In this project, this trick helps after multiplier outputs, not on multiplier inputs
• In our project, the final output y(n) is bounded within a 1.K bit wordlength. This has been controlled by scaling the inputs and coefficients in Matlab.
• To learn about more helpful hardware “tricks” take ECE 645 next semester!
1.K 1.K 1.K 1.K
y(n)
1.K 1.K 1.K 1.K
y(n)
2.K 3.K 3.K
1.K 1.K 1.K