Comparison of the Hardware Performance of the AES Candidates Using Reconfigurable Hardware A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science at George Mason University By Pawel R. Chodowiec Bachelor of Science Warsaw University of Technology, 1998 Director: Kris M. Gaj, Assistant Professor Department of Electrical and Computer Engineering Spring Semester 2002 George Mason University Fairfax, VA
150
Embed
Comparison of the Hardware Performance of the AES ... · Comparison of the Hardware Performance of the AES Candidates Using Reconfigurable Hardware A thesis submitted in partial fulfillment
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Comparison of the Hardware Performance of the AES Candidates Using ReconfigurableHardware
A thesis submitted in partial fulfillment of the requirements for the degree of Master ofScience at George Mason University
By
Pawel R. ChodowiecBachelor of Science
Warsaw University of Technology, 1998
Director: Kris M. Gaj, Assistant ProfessorDepartment of Electrical and Computer Engineering
Spring Semester 2002George Mason University
Fairfax, VA
ii
Table of Contents
PageAbstract ............................................................................................................................viii1. Preface............................................................................................................................. 11.1 Data Encryption Standard ............................................................................................. 11.2 DES security ................................................................................................................. 22. Introduction..................................................................................................................... 62.1 Advanced Encryption Standard .................................................................................... 62.1.1 Requirements and evaluation criteria......................................................................... 72.1.2 Evaluation process ................................................................................................... 102.2 Need for comparison of hardware implementations................................................... 132.3 Previous work ............................................................................................................. 143. Characteristics of hardware implementations............................................................... 163.1 Hardware vs. software implementations..................................................................... 163.2 Parameters of hardware implementations................................................................... 233.2.1 Throughput............................................................................................................... 233.2.2 Latency..................................................................................................................... 243.2.3 Area.......................................................................................................................... 263.3 Design tradeoffs .......................................................................................................... 303.3.1 Increasing the throughput ........................................................................................ 313.3.2 Decreasing the area .................................................................................................. 434. Hardware architectures for symmetric-key block ciphers ............................................ 454.1 Main characteristics of block ciphers ......................................................................... 454.1.1 Structure of a symmetric-key block cipher .............................................................. 454.1.2 Key schedule............................................................................................................ 474.1.3 Modes of operation .................................................................................................. 484.2 Basic iterative architecture.......................................................................................... 514.3 Loop unrolling ............................................................................................................ 554.4 Outer round pipelining................................................................................................ 604.5 Inner round pipelining................................................................................................. 654.6 Mixed inner- and outer-round pipelining.................................................................... 705. Methodology of comparison of AES candidates .......................................................... 745.1 Limits of this research................................................................................................. 745.2 Choice of architectures ............................................................................................... 775.2.1 Comparison in feedback modes ............................................................................... 775.2.2 Comparison in non-feedback modes........................................................................ 785.3 Tools, design process and synthesis parameters ......................................................... 79
iii
6. Implementation of AES candidates .............................................................................. 826.1 MARS ......................................................................................................................... 826.1.1 Structure and components of MARS ....................................................................... 826.1.2 Implementation of multiplication modulo 232 ......................................................... 896.1.3 Results of the implementation of MARS................................................................. 946.2 RC6 ............................................................................................................................. 966.2.1 Structure and components of RC6 ........................................................................... 966.2.2 Implementation of squaring modulo 232 .................................................................. 986.2.3 Results of the implementation of RC6................................................................... 1006.3 Rijndael ..................................................................................................................... 1026.3.1 Structure and components of Rijndael ................................................................... 1026.3.2 Results of the implementation of Rijndael............................................................. 1086.4 Serpent ...................................................................................................................... 1096.4.1 Structure and components of Serpent .................................................................... 1096.4.2 Results of the implementation of Serpent.............................................................. 1126.5 Twofish ..................................................................................................................... 1136.5.1 Structure and components of Twofish ................................................................... 1136.5.2 Results of the implementation of Twofish............................................................. 1187. Analysis of the results ................................................................................................. 1197.1 Comparison of ciphers in feedback modes ............................................................... 1197.2 Comparison of ciphers in non-feedback modes........................................................ 1258. Summary..................................................................................................................... 132List of References ........................................................................................................... 138
iv
List of Tables
Table Page2.1-1 Fifteen candidate algorithms ................................................................................... 112.1-2 Security margins of final AES candidate algorithms .............................................. 123.1-I Characteristic features of implementations of cryptographic transformations inASICs, FPGAs, and software............................................................................................ 223.3-I Features of methods exploring parallel computations ............................................. 42
v
List of Figures
Figure Page3.1-1 Structure of the Virtex FPGA.................................................................................. 183.1-2 Example of an attack on the hardware implementation. ......................................... 203.2-1 System consisting of multiple modules with throughput parameters...................... 243.2-2 System consisting of multiple modules with latency parameters............................ 253.2-3 Circuit with FIFO buffers. ....................................................................................... 253.2-4 Variety of functions possible to implement using one lookup table (LUT)............ 283.2-5 Example of LUT utilization..................................................................................... 293.3-1 Parallel processing units – string of data split among units. ................................... 313.3-2 Principles of pipelined implementation................................................................... 333.3-3 Pipeline with delay of registers taken into account. ................................................ 353.3-4 Unbalanced pipeline. ............................................................................................... 363.3-5 Pipelining of circuit consisting of unequal operations. ........................................... 373.3-6 Example of unnecessarily placed register. .............................................................. 383.3-7 Throughput in the pipelined implementations......................................................... 393.3-8 Pipelining of Feistel-network cipher. ...................................................................... 413.3-9 Example of an array multiplier as a circuit requiring additional area for registerswhen pipelined. ................................................................................................................. 423.3-10 Resource sharing.................................................................................................... 444.1-1 Flow diagram of a typical symmetric-key block cipher. ......................................... 464.1-2 Example of feedback and non-feedback modes of operation.................................. 494.1-3 Counter mode. ......................................................................................................... 504.2-1 Basic iterative architecture ...................................................................................... 524.2-2 Critical path in the basic iterative architecture. ....................................................... 534.3-1 Loop unrolling. ........................................................................................................ 554.3-2 Optimization of logic across rounds. ....................................................................... 574.3-3 Simultaneous evaluation of functions in unrolled rounds. ...................................... 574.3-4 Throughput vs. area ratio for unrolled architectures. .............................................. 594.4-1 Outer round pipelining............................................................................................. 614.4-2 Optimization of logic across rounds. ....................................................................... 634.4-3 Throughput vs. area ratio in outer round pipelining. ............................................... 654.5-1 Inner round pipelining. ............................................................................................ 664.5-2 Throughput vs. area ratio for inner round pipelining. ............................................. 704.6-1 Mixed inner- and outer-round pipelining. ............................................................... 714.6-2 Throughput vs. area ratio for mixed pipelining....................................................... 734.6-3 Latency vs. area ratio for mixed pipelining............................................................. 73
vi
5.1-1 Block diagram common for all implementations. ................................................... 765.2-1 Throughput/area ratio for mixed architecture.......................................................... 795.3-1 Design flow for each implementation. .................................................................... 806.1-1 High-level structure of MARS. ............................................................................... 836.1-2 Mixing transformation............................................................................................. 846.1-3 Mixing transformation core. .................................................................................... 846.1-4 Keyed transformation. ............................................................................................. 866.1-5 Keyed transformation core. ..................................................................................... 866.1-6 E-function. Red line indicates critical path. ............................................................ 876.1-7 Variable rotation. ..................................................................................................... 886.1-8 Virtex Slice with carry logic.................................................................................... 896.1-9 Example of multiplication scheme. Two AND gates feed full adder...................... 906.1-10 Multiplication – implementation of the circuit from 6.1-9 in a Vritex Slice......... 916.1-11 Array multiplier modulo 28. .................................................................................. 926.1-12 Structure of an array multiplier with reversed order of additions. ........................ 926.1-13 Change from array to tree. ..................................................................................... 936.1-14 Final multiplication schematic............................................................................... 946.2-1 Implementation of one round of RC6...................................................................... 976.2-2 Squarer derived from array multiplier. .................................................................... 986.2-3 Squarer modulo 28. .................................................................................................. 996.2-4 Optimized squarer modulo 28. ................................................................................. 996.3-1 One round of Rijndael. .......................................................................................... 1036.3-2 Construction of a) ByteSub, b) InvByteSub transformations................................ 1036.3-3 Structure of the implementation of a single round of Rijndael. ............................ 1076.4-1 Single round of Serpent. ........................................................................................ 1096.4-2 Implementation of Serpent I8 in basic architecture............................................... 1106.4-3 Serpent I1............................................................................................................... 1116.5-1 High-level structure of the Twofish cipher............................................................ 1146.5-2 S-boxes in Twofish................................................................................................ 1156.5-3 Permutation q......................................................................................................... 1156.5-4 PHT transformation. .............................................................................................. 1176.5-5 Implementation of a single round of Twofish. ...................................................... 1177.1-1 Throughput for Virtex XCV-1000, my results. ..................................................... 1197.1-2 Area for Virtex XCV-1000, my results. ................................................................ 1207.1-3 Throughput for Virtex XCV-1000, comparison with results of other groups. ...... 1217.1-4 Area for Virtex XCV-1000, comparison with results of other groups. ................. 1227.1-5 Throughput vs. area for Virtex-1000, our results. The result for Serpent I1 based on[12].................................................................................................................................. 1237.1-6 Throughput vs. area for 0.5�m CMOS standard-cell ASICs, NSA result. ........... 1247.2-1 Throughput for mixed inner- and outer-round pipelining in Virtex1000, my results.......................................................................................................................................... 1267.2-2 Area for mixed inner- and outer-round pipelining on Virtex1000, my results...... 1277.2-3 Increase in the encryption/decryption latency as a result of moving from the basicarchitecture to mixed inner- and outer-round pipelining. ............................................... 127
vii
7.2-4 Throughput for 0.5 �m CMOS standard-cell ASICs, NSA results. ...................... 1288-1 Results of survey filled by participants of the AES3 conference. ............................ 135
Abstract
COMPARISON OF THE HARDWARE PERFORMANCE OF THE AES
CANDIDATES USING RECONFIGURABLE HARDWARE
Pawel Chodowiec, Computer Engineering M.S.
George Mason University, 2002
Thesis Director: Dr. Kris M. Gaj
The results of fast implementations of all five AES final candidates using Virtex Xilinx
Field Programmable Gate Arrays are presented and analyzed. Performance of several
alternative hardware architectures is discussed and compared. One architecture optimum
from the point of view of the throughput to area ratio is selected for each of the two major
types of block cipher modes. For feedback cipher modes, all AES candidates have been
implemented using the basic iterative architecture, and achieved speeds ranging from 61
Mbit/s for Mars to 431 Mbit/s for Serpent. For non-feedback cipher modes, four AES
candidates have been implemented using a high-throughput architecture with pipelining
inside and outside of cipher rounds, and achieved speeds ranging from 12.2 Gbit/s for
Rijndael to 16.8 Gbit/s for Serpent. A new methodology for a fair comparison of the
hardware performance of secret-key block ciphers has been developed and contrasted
with methodology used by the NSA team.
1
Preface
1.1 Data Encryption Standard
DES is probably one of the best-studied and most controversial ciphers. Its history
began in 1973 when National Bureau of Standards (NBS) issued a public request for
proposals for a standard symmetric key cryptographic algorithm. The request specified a
series of design criteria. Some of the most important requirements were:
� The algorithm had to provide a high level of security,
� The algorithm had to be completely specified and easy to understand,
� The security of the algorithm had to reside in the key, and cannot depend on the
secrecy of the algorithm,
� The algorithm had to be available to all users on the royalty free basis,
� The algorithm had to be adaptable for use in diverse applications,
� The algorithm had to be economically implementable in electronic devices.
In 1974 IBM submitted a promising algorithm as a response for this request. NBS
asked National Security Agency (NSA) for help in evaluating the algorithm. NSA
introduced a few changes to the algorithm. The key length has been shortened from 128
to 56 bits. The content of all S-boxes has also been changed. NSA, however, classified all
information justifying these changes. Since then, DES started to be criticized. Many
2
researchers suspected that NSA installed a trapdoor in S-boxes permitting NSA to
cryptanalyze the algorithm. Also the reduction of the key length was controversial.
Despite all the criticism, DES was adopted as a US encryption standard in 1977,
and became de facto a world standard. The algorithm is defined in the American standard
FIPS 46 "Data Encryption Standard", and is described as a 16-round Feistel-network
cipher operating on 64-bit blocks of data.
The terms of DES standard required its review every five years. In 1983 the
standard has been automatically recertified for the next five years. In 1987 NSA proposed
the Commercial COMSEC Endorsement Program, which would lead to the development
of a series of algorithms replacing DES. Those algorithms would not be made public, and
would be available only in a tamper-proof VLSI chips. The NSA's proposal was not well
received, and because of the lack of other propositions DES standard remained effective
for next five years. In 1993 a 15 years old standard remained still unbroken. Again lack
of any alternative led to its recertification for another five years. In 1997, the National
Institute of Standards and Technology (NIST; former NBS) aware of the DES weakness,
laying mainly in its short key, announced a contest for the development of Advanced
Encryption Standard, which is going to replace a 20 year old DES.
1.2 DES security
The unclear design criteria classified by NSA sparked the biggest worldwide
effort to break DES. What were the criteria for choosing S-boxes? Why does DES consist
of exactly 16 rounds? Why does the key have only 56 bits? Those and other questions
3
exposed DES for cryptanalysis like no other cipher. Despite all attempts to find any
crack, DES secrets remained uncovered for nearly 15 years. Finally in 1990 Eli Biham
and Adi Shamir discovered differential cryptanalysis, the new and powerful method of
cryptanalysis [5]. DES appeared to be surprisingly resistant to the new attack [6, 7]. The
attack requires 247 chosen plaintexts or 255 known plaintexts and the analytical
complexity of 237 operations. The enormous amount of data and time to mount the attack
makes it less efficient than the brute force search for the key. Biham and Shamir came to
interesting conclusions:
� The S-boxes happened to be optimized against differential cryptanalysis,
� Any number of rounds less than 16 makes differential cryptanalysis more efficient
than the brute force attack for a known-plaintext attack.
Why is DES so resistant to an attack discovered many years after its
development? The answer to this question is even more surprising. The designers of DES
already knew differential cryptanalysis at the design time. After consultation with NSA,
they decided that disclosure of the design considerations might reveal the differential
cryptanalysis. Although DES was already resistant to the new attack, many other ciphers,
which were already in use, appeared to be vulnerable. After publishing details of
differential cryptanalysis, IBM finally published the design criteria for the S-boxes and P-
box, showing that no trapdoor was intended to be installed. Soon researchers in the open
cryptographic community began to appreciate the design principles behind DES.
4
Is DES still secure today? No attack better than the brute force search has been
discovered, but the main accusation of a too short key remains irrefutable. Ever since
DES was first proposed in the 1970s, it has been criticized for its short key. The US
government officials claimed that governments cannot decrypt information when
protected by DES, or that it would take multimillion-dollar networks of computers and
months to decrypt one message.
In 1997 RSA Laboratories issued a series of challenges in order to demonstrate
that DES offers only marginal protection. The first DES Challenge was launched in
January 1997. The secret key was recovered in 96 days by a team led by Rocke Verser of
Loveland, Colorado. In February 1998, Distributed.Net won DES Challenge II-1 during a
41-day effort. Distributed.Net consolidated tens of thousands of computers connected
through the Internet for this task. In July, the Electronic Frontier Foundation (EFF) won
DES Challenge II-2 by recovering an encrypted message in 56 hours shattering the
previous record. The answer for the challenge was “It’s time for those 128-, 192-, and
256-bit keys”. The main significance of the new record lies in the fact, that only one
machine, specifically designed for cracking DES, achieved what US government claimed
was impossible.
The design of the EFF DES Cracker consists of an ordinary personal computer
connected to the large array of custom chips. One ASIC chip contains 24 search units,
each capable of checking 2.5 million keys per second. Over 1800 chips were used in the
design giving the search speed of 90 billion keys per second. The average time to recover
the key is only 4.5 days. It took EFF less than one year to build Deep Crack, and it cost
5
only $220,000. EFF and O’Reilly and Associates have published a book about EFF DES
Cracker [EFF+98]. The book contains the complete design details for the Deep Crack
chips, boards and software. Ipso facto, EFF proved that DES is undoubtedly insecure.
Moreover, it proves also that most of the world’s governments already built similar or
even more powerful machines.
DES Cracker "DeepCrack" custom microchip.
DES Cracker circuitboard fitted with DeepCrack chips.
The machine tests over 90billion keys per second, takingan average of less than 5 days todiscover a DES key.
The final nail was put into the Data Encryption Standard coffin on January 19,
1999. The Distributed.Net and EFF DES Cracker won DES Challenge III in 22 hours and
6
15 minutes. Over 100,000 computers connected through the Internet and EFF’s machine
were testing 245 billion keys per second when the key was found. The decrypted message
foreshadows a new standard: “See you in Rome (second AES Conference), March 22-23,
1999.”
6
Introduction
2.1 Advanced Encryption Standard
After DES was shown to be vulnerable to the brute force attack, the need for a
new standard became unquestionable. There already exists an ANSI encryption standard,
3DES [3], which offers higher security than DES [16], but it is highly inefficient,
especially in software implementations. DES was primarily designed for hardware
implementations in existing technology. Nevertheless, current demands for higher
bandwidths in both computer and telecommunication networks are becoming difficult to
satisfy by 3DES encryption devices, especially when feedback modes of operation are
being considered. It was shown in [32] that DES implemented in VirtexE-8 FPGA in
non-feedback mode can achieve a throughput of 12 Gbps. This would translate to 4 Gbps
for 3DES. Po Khuon has demonstrated a 3DES implementation in Virtex-6 FPGA
capable of handling a throughput of 59 Mbps in feedback mode [19] and 7 Gbps in non-
feedback mode for deeply pipelined design [9]. My recent research led to implementation
of 3DES in Virtex-6 FPGA that achieves a throughput of 116 Mbps in feedback mode. Of
course, ASIC devices can satisfy higher throughput demands. One of the reported
implementations of 3DES in older 0.6 micron CMOS technology is capable of encrypting
data with a throughput of at least 155 Mbps [21]. Many current computer and
telecommunication networks require higher throughputs in the range of gigabits per
7
second. I already participate in designing of hardware accelerators for encryption
algorithms used in 1 Gbps IPSec implementation [8]. However, next generation of 10
Gbps LAN networks is being developed and 10 Gbps encryption speed will soon be
required. Clearly, 3DES algorithm can be a serious bottleneck in those applications.
The National Institute of Standards and Technology (NIST) has recognized the
need for new standard and initiated the process of developing an Advanced Encryption
Standard [2]. The main NIST’s objective was to develop an algorithm, which offers
security at least equal to 3DES, and significantly more efficient in software and hardware
implementations in variety of platforms. The algorithm should be capable of protecting
sensitive government information well into the 21st century.
2.1.1 Requirements and evaluation criteria
NIST published a formal call for candidate algorithms in 1997. The minimum
acceptable capabilities were:
1. The algorithm must implement symmetric (secret) key cryptography.
2. The algorithm must be a block cipher.
3. The candidate algorithm shall be capable of supporting key-block combinations with
sizes of 128-128, 192-128, and 256-128 bits.
8
In addition to the above list all submissions should include:
� A complete written specification of the algorithm, consisting of all necessary
mathematical equations, tables, diagrams, and parameters needed to implement
the algorithm,
� A statement of the algorithm’s estimated computational efficiency in hardware
and software. Submitters were required to at least provide estimates for “NIST
AES analysis platform” and for 8-bit processors,
� A set of test vectors allowing verification of correctness of all implementations,
� A statement of the expected strength of the algorithm along with any supporting
rationale,
� Analyses of the algorithm with respect to known attacks. All known weak keys,
equivalent keys, complementation properties, restrictions on key selection, and
similar features of the algorithm should also be noted,
� Optimized and reference source codes in ANSI C and Java describing the
algorithm,
� Declarations of granting full rights to patents covering the algorithm when and if
it should be chosen as a federal standard.
It was a remarkable change in the government’s approach to the security issue.
The previous government standard, DES [16], has been developed in close cooperation
with the National Security Agency (NSA). NSA has concealed the design criteria and
justifications, which resulted in a lack of trust in the standard. This time NIST organized
9
the entire process in the form of a contest. Anybody could submit his own algorithm.
Submitters were obliged to reveal all information about the algorithms and justify all
design decisions. The entire cryptographic community evaluated all algorithms openly.
The organization of the AES selection had several important advantages in that it:
� Focused the effort of cryptographic community on one task, what was very
essential taking into account the small number of specialists in unclassified
research,
� Stimulated research on methods of constructing secure ciphers,
� Avoided backdoor theories, and,
� Speeded-up the acceptance of the standard.
All algorithms were evaluated with respect to three categories of criteria:
1. SECURITY - the most important factor in the evaluation
� Actual security offered by the algorithm,
� Extent to which the algorithm output is indistinguishable from random
permutation on the input block,
� Soundness of the mathematical basis for the algorithm’s security,
� Other security factors, for example attacks, which demonstrate that the actual
security of the algorithm is less than the strength claimed by the submitter.
2. COST
� Licensing requirements,
10
� Computational efficiency – speed of the algorithm in hardware and software,
� Memory requirements – in case of software implementations, code size and
RAM requirements are major factors. In case of hardware implementations,
gate count will be taken into account.
3. ALGORITHM AND IMPLEMENTATION CHARACTERISTICS
� Flexibility – the ability of the algorithm to be implemented on different
platforms for various applications,
� Hardware and software suitability – the algorithm should not be restricted to
hardware or software implementations only,
� Simplicity – simplicity of design and ease of implementation.
2.1.2 Evaluation process
The process of evaluating candidate algorithms has been divided into two rounds.
The first round was intended to focus on the evaluation of algorithms based on the
cryptanalysis performed by public as well as the efficiency of software implementations
on a variety of platforms. The AES contest had attracted 15 submissions of block ciphers
from four continents, and 12 countries, as shown in Table 2.1-1. Most of the algorithms
came from outside of the USA, demonstrating the large interest of the broad
cryptographic community in the development of the U.S. government encryption
standard.
11
Table 2.1-1 Fifteen candidate algorithms.
Continent Country CipherNorth America Canada CAST-256
DealUSA Mars
RC6TwofishSafer+HPC
Costa Rica FrogEurope Germany Magenta
Belgium RijndaelFrance DFCIsrael, UK,Norway
Serpent
Asia Korea CryptonJapan E2
Australia Australia LOKI97
Only five algorithms passed to the second round of the evaluation: Mars [4], RC6
[28], Rijndael [11], Serpent [1], and Twofish [30]. All of the final candidates proved to
be sufficiently secure according to the best knowledge available during their analysis. Of
course, nobody can absolutely claim invulnerability of his design to future cryptanalysis
methods. At best, only estimation based on the current state of the art in cryptanalysis can
be made. One of the ways of assessing the security of symmetric-key ciphers is based on
differential [5] and linear [23] cryptanalyses. For both methods, there can be found a
minimal number of rounds, which make the attack less practical than the brute force
search. Any number of rounds greater than the minimum is believed to create the security
12
margin, a type of assurance by designers themselves against future attacks. Table 2.1-2
summarizes the security features of five final candidates. Obviously, ciphers with greater
security margins pay a price in the speed of operation, since their numbers of rounds are
greater than necessary.
Table 2.1-2 Security margins of final AES candidate algorithms.
Algorithm Number ofrounds
Minimumnumber ofroundsbelieved tobe secure
Securitymargin
Number ofrounds ofthe bestactual orestimatedattack
where '+' represents an addition modulo 2, i.e. an XOR operation.
As a result each bit of a product B, can be represented as an XOR function of at most
three variable input bits, e.g., b7 = (a7 + a6) , b4 = (a4 + a3+ a7), etc.
Each byte of the result of a matrix multiplication is an XOR of four bytes
representing the Galois Field product of a byte A0, A1, A2, or A3 by a respective constant.
The entire MixColumn transformation can be performed using two layers of XOR gates,
with up to 3-input gates in the first layer, and 4-input gates in the second layer. In Virtex
FPGAs, each of these XOR operations requires only one lookup table (i.e., a half of a
CLB Slice).
106
The InvMixColumn transformation can be expressed as a following matrix
multiplication in GF(28).
����
�
�
����
�
�
����
�
�
����
�
�
�
����
�
�
����
�
�
BBBB
0E090D0B0B0E090D0D0B0E09090D0B0E
AAAA
3
2
1
0
3
2
1
0
The primary differences compared to MixColumn are the larger hexadecimal
values of the matrix coefficients. Multiplication by these constant elements of the Galois
Field leads to the more complex dependence between the bits of a variable input and the
bits of a respective product. For example, the multiplication A='0E' � B, leads to the
following dependence between the bits of A and B:
a7 = b6 + b5 + b4
a6 = b5 + b4 + b3 + b7
a5 = b4 + b3 + b2 + b6
a4 = b3 + b2 + b1 + b5
a3 = b2 + b1 + b0 + b6 + b5
a2 = b1 + b0 + b6
a1 = b0 + b5
a0 = b7 + b6 + b5
107
The entire InvMixColumn transformation can be performed using two layers of
XOR gates, with up to 6-input gates in the first layer, and 4-input gates in the second
layer. In Virtex FPGAs, an implementation of a 6-input XOR operation requires two
layers of CLB Slices. As a result, the InvMixColumn transformation has a significantly
longer critical path compared to the MixColumn transformation, and the entire decryption
is more time consuming than encryption.
Taking into account all properties of the component operations, I have
implemented Rijndael in the structure shown in Figure 6.3-3.
inversed elementin Galois field
affinetransformation
inversed affinetransformation
ShiftRow
MixColumn
subkey
InvShiftRow
subkey
InvMixColumn
encryption decryption
inversed elementin Galois field
affinetransformation
inversed affinetransformation
ShiftRow
MixColumn
subkey
InvShiftRow
subkey
InvMixColumn
encryption decryption
Figure 6.3-3 Structure of the implementation of a single round of Rijndael.
108
6.3.2 Results of the implementation of Rijndael
Throughput and area in basic architecture
The implementation of Rijndael in a basic iterative architecture has taken 2,507
CLB Slices, very close to the size of MARS. However, the maximum clock frequency
indicated by the static analyzer was 32.3 MHz. Together with the small number of rounds
it puts Rijndael in a high position in the AES ranking with the throughput of 413.4 Mbps.
This result is much better than for MARS and RC6.
Throughput and area in mixed architecture
Implementing Rijndael in pipelined architecture was more challenging than it may
at first seem. The main difficulty lies in pipelining the S-boxes which one would normally
leave to the synthesis tool for decomposition. Unfortunately our synthesizer does not
insert pipeline stages automatically. I could buy a special core for distributed memory,
which allows such optimization, or do the decomposition manually, but both solutions
did not seem to guarantee a good performance. For this reason I have decided to use
Block SelectRAMs to implement S-boxes.
I have introduced 7 pipeline stages into a single round, what gives a total of 70
stages for the full cipher. The amount of area required by the implementation was in the
range of 12,600 CLB Slices + 80 Block SelectRAMs. I could run the circuit with a 95
MHz clock, and this result gives a throughput of 12.1 Gbps.
109
6.4 Serpent
6.4.1 Structure and components of Serpent
Serpent is a block cipher developed in international cooperation between R.
Anderson, E. Biham and L. Knudsen [1]. All the submitters are very well known
cryptanalysts. Authors emphasize that their design philosophy was highly conservative,
therefore only operations well studied and understood are used. Taking into account the
reputation of the submitters, it is not surprising that Serpent has the largest security
margin among all candidates. Serpent belongs to a class of SP-network ciphers. It
consists of 32 small and simple rounds. Figure 6.4-1 shows one round of Serpent. Last
round is slightly different, but does not impose any significant constraints at the design.
S-boxes
Lineartransformation
input
output
subkey
Figure 6.4-1 Single round of Serpent.
110
Unfortunately, not all rounds are identical. The cipher employs eight different sets
of 4x4 S-boxes that repeat every eight rounds. Additionally, encryption and decryption
consist of different operations and we cannot take any advantage of resource sharing
between encryption and decryption.
Serpent can still be implemented in a basic architecture evaluating one round per
clock cycle, but it requires the implementation of some switching circuit selecting S-
boxes, as shown in Figure 6.4-3, and turns out to be very inefficient. I have made an
exception for Serpent and have unrolled eight rounds treating this configuration as a basic
architecture, as shown in Figure 6.4-2. I call this architecture Serpent I8.
input
128-bit register
32 copies of S-box 0
linear transformationwith included subkey
K0
linear transformation
K7
K32
output
128
128
128
K1
32 copies of S-box 7
linear transformationwith included subkey
input
128-bit register
Inversed lineartransformation
32 copies of inversedS-box 7 with subkey
K32
output
128
128
K7
Inversed lineartransformation
32 copies of inversedS-box 1 with subkey
K1
32 copies of inversedS-box 0 with subkey
K0
Inversed lineartransformation
b)a)
input
128-bit register
32 copies of S-box 0
linear transformationwith included subkey
K0
linear transformation
K7
K32
output
128
128
128
K1
32 copies of S-box 7
linear transformationwith included subkey
input
128-bit register
32 copies of S-box 0
linear transformationwith included subkey
K0
linear transformation
K7
K32
output
128
128
128
K1
32 copies of S-box 7
linear transformationwith included subkey
input
128-bit register
Inversed lineartransformation
32 copies of inversedS-box 7 with subkey
K32
output
128
128
K7
Inversed lineartransformation
32 copies of inversedS-box 1 with subkey
K1
32 copies of inversedS-box 0 with subkey
K0
Inversed lineartransformation
input
128-bit register
Inversed lineartransformation
32 copies of inversedS-box 7 with subkey
K32
output
128
128
K7
Inversed lineartransformation
32 copies of inversedS-box 1 with subkey
K1
32 copies of inversedS-box 0 with subkey
K0
Inversed lineartransformation
b)a)
Figure 6.4-2 Implementation of Serpent I8 in basic architecture. a)
encryption, b) decryption.
111
As I have mentioned, the S-boxes accept only 4 inputs, and therefore match
exceptionally well the structure of FPGA. Moreover, the linear transformation consists
only of two levels of XORs, which can be implemented very efficiently on LUTs. The
same observations apply to decryption circuit. Serpent matches the internal architecture
of an FPGA so well that it is hard to believe that its designers are mathematicians with no
hardware design experience. The implementation of Serpent took us the least amount of
time.
Some of the research groups have implemented Serpent based on only one round
and switching S-boxes as shown in Figure 6.4-3. We refer to this architecture as Serpent
I1.
128-bit register
32 x S-box 0
Ki regular Serpent round
32 x S-box 7
linear transformationK32
output
128
128
128
32 x S-box 1
8-to-1 128-bit multiplexer128 128 128
128 128 128
128-bit register
32 x S-box 0
Ki regular Serpent round
32 x S-box 7
linear transformationK32
output
128
128
128
32 x S-box 1
8-to-1 128-bit multiplexer128 128 128
128 128 128
Figure 6.4-3 Serpent I1.
112
6.4.2 Results of the implementation of Serpent
Throughput and area in basic architecture
The implementation of Serpent in the basic iterative architecture, as shown in
Figure 6.4-2, has taken 4,507 CLB Slices, and presents the largest circuit, but we have to
keep in mind that eight rounds have been unrolled. The maximum clock frequency
indicated by the static analyzer was 13.5 MHz, which with combination of only four
clock cycles per block gives the throughput of 431 Mbps. This result outperforms all
other ciphers.
Throughput and area in mixed architecture
Applying pipelining to Serpent was a very easy task because this cipher has a very
FPGA-friendly structure. We have introduced only three pipeline stages per one round,
what gives a total of 96 pipeline stages for the entire implementation. The circuit takes
approximately 19,700 CLB Slices, which indicates a very small area increase associated
with introducing registers. We could run this circuit at a clock frequency of 130.9 MHz,
which gives a high throughput of 16.7 Gbps. This is the best result achieved by any
cipher reported in the literature.
113
6.5 Twofish
6.5.1 Structure and components of Twofish
Twofish was submitted to the AES contest by a team from Counterpane Systems
[29] led by B. Schneier, a well known cryptanalyst. It almost perfectly follows the
classical Feistel-network structure, and performing encryption and decryption in the same
circuit requires introducing only a very small amount of switching logic. The entire
structure of the cipher is shown in Figure 6.5-1.
The designers of Twofish have introduced a new idea in cipher design; the use of
key dependent S-boxes. Unlike in other ciphers using fixed S-boxes, the contents of key
dependent S-boxes changes for every key, making cryptanalysis certainly much harder.
The perfect way of implementing those 8x8 S-boxes would be by expressing them as
memories, which could be filled with new contents every time the keys are changed. It
could be done using Block SelectRAMs in Virtex FPGA. Four such RAM blocks would
be sufficient to implement all eight S-boxes. This solution could be accepted only in case
of the basic iterative architecture when we do not need to change keys on the fly. In the
case of the pipelined architectures, changing the contents of memory in one clock cycle is
undoable unless we could make use of several memory modules and switch among them.
I have chosen not to use this technique, and I have implemented the algorithm, which
computes contents of S-boxes inside the cipher round.
114
P (128 bits)
+
+
+
+<<< 8
<<< 1
>>> 1
S-box 0
S-box 1
S-box 2
S-box 3
MDS
g
S-box 0
S-box 1
S-box 2
S-box 3
MDS
g
PHT
F
K2 K3K1K0
K2r+8
K2r+9
K6 K7K5K4
C (128 bits)
Rep
eat 1
6 tim
es
P (128 bits)
+
+
+
+<<< 8
<<< 1
>>> 1
S-box 0
S-box 1
S-box 2
S-box 3
MDS
g
S-box 0
S-box 1
S-box 2
S-box 3
MDS
g
S-box 0
S-box 1
S-box 2
S-box 3
MDS
g
PHT
F
K2 K3K1K0
K2r+8
K2r+9
K6 K7K5K4
C (128 bits)
K6 K7K5K4
C (128 bits)
Rep
eat 1
6 tim
es
Figure 6.5-1 High-level structure of the Twofish cipher.
Each S-box consists of three permutations interleaved with keys S0 and S1, as
shown in Figure 6.5-2. Each q-permutation can be efficiently implemented on LUTs, as it
consists of small 4x4 t-boxes shown in Figure 6.5-3, which match very well the internal
architecture of FPGA.
115
q0
q1
q0
q1
q0
q0
q1
q1
q1
q0
q1
q0
S0 S1
S-box 0
S-box 1
S-box 2
S-box 3
q0
q1
q0
q1
q0
q0
q1
q1
q1
q0
q1
q0
S0 S1
S-box 0
S-box 1
S-box 2
S-box 3
Figure 6.5-2 S-boxes in Twofish.
t0 t1
>>>1 a(0), 0, 0, 0
a b
t2 t3
>>>1 a'(0), 0, 0, 0
a' b'
84 4
44
8
t0 t1
>>>1 a(0), 0, 0, 0
a b
t2 t3
>>>1 a'(0), 0, 0, 0
a' b'
84 4
44
8
Figure 6.5-3 Permutation q.
116
Another function used in Twofish is a 4-by-4-byte MDS matrix. The
transformation performed by this matrix is described by the formula:
�����
�
�
�����
�
�
�
�����
�
�
�����
�
�
�
�����
�
�
�����
�
�
3y2y1y0y
5BEF01EFEF015BEF01EFEF5B5B5BEF01
3z2z1z0z
where: y3...y0 are consecutive bytes of the input 32-bit word (y3 is the most significant
byte), and z3...z0 form the output word. This matrix multiplies a 32-bit input value by 8-
bit constants, with all multiplications performed (byte by byte) in the Galois field GF
(28). The primitive polynomial is x8 + x6 + x5 + x3 + 1. Only three different
multiplications are used effectively in the MDS matrix, namely multiplication
� by 5B16 = 0101 10112 (represented in GF(28) by a polynomial x6 + x4 + x3 + x +
1),
� by EF16 = 1110 11112 (x7 + x6 + x5 + x3 + x2 + x + 1), and
� by 0116 = 0000 00012 (equivalent element in GF(28) is just 1) - obviously the
result is equal to the input value.
Finally, the PHT transform is a simple function that consists of two additions modulo 232,
as shown in Figure 6.5-4. Both additions are de facto independent and can be performed
simultaneously.
117
+ +
a b
a' b'
<<1
Figure 6.5-4 PHT transformation.
As I have mentioned at the beginning of this section, both encryption and
decryption transformations can be implemented within the same circuit with a small
amount of additional logic. Figure 6.5-5 shows the structure of an implementation of a
single round used in my design.
128-bit register
F - function
<<<1
>>>1
<<<1
>>>1
128-bit register
F - function
<<<1
>>>1
<<<1
>>>1
Figure 6.5-5 Implementation of a single round of Twofish.
118
6.5.2 Results of the implementation of Twofish
Throughput and area in basic architecture
Twofish matches very well the structure of an FPGA which results in a compact
design. Its implementation has taken 1,076 CLB Slices. The maximum clock frequency
indicated by the static analyzer was 22.1 MHz and this translates to a throughput of 177
Mbps.
Throughput and area in mixed architecture
Twofish has a quite long critical path through its round and there exist a lot of
room for pipeline stages. I have introduced as many registers as I could and this resulted
in a very deep pipeline with 24 stages per round. Hence, the total amount of stages for the
full cipher is 384. The area of the circuit was in the range of 21,000 CLB Slices, and it
could be run with the clock frequency of 119 MHz. This gives a high throughput of 15.2
Gbps. As we can see, the number of introduced pipeline stages proved to be too big as the
gain in clock frequency was only by a factor of five. Similarly to RC6, we could most
likely obtain a similar performance for less than ten pipeline stages.
119
Analysis of the results
7.1 Comparison of ciphers in feedback modes
The results of implementing AES candidates, according to the assumptions and
design procedure summarized in chapter 5, are shown in Figure 7.1-1 and Figure 7.1-2.
0
50
100
150
200
250
300
350
400
450
500
Serpent Rijndael Twofish RC6 Mars 3DES
431414
177142
61 59
Throughput [Mbps]
0
50
100
150
200
250
300
350
400
450
500
Serpent Rijndael Twofish RC6 Mars 3DES
431414
177142
61 59
Throughput [Mbps]
Figure 7.1-1 Throughput for Virtex XCV-1000, my results.
All implementations were based on Virtex XCV-1000BG560-6, one of the largest
currently available Xilinx Virtex devices. Additionally I have implemented a current
ANSI standard [3], Triple DES, which I used as a reference for comparison.
120
Implementations of all ciphers took from 9% (for Twofish) to 37% (for Serpent
I8) of the total number of 12,288 CLB slices available in the Virtex device used in my
designs. It means that less expensive Virtex devices could be used for all
implementations. Additionally, the key scheduling unit could be easily implemented
within the same device as the encryption/decryption unit.
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
SerpentRijndaelTwofish RC6 Mars 3DES
1076 1137
27442507
4507
356
Area [CLB Slices]
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
SerpentRijndaelTwofish RC6 Mars 3DES
1076 1137
27442507
4507
356
Area [CLB Slices]
Figure 7.1-2 Area for Virtex XCV-1000, my results.
In Figure 7.1-3 and Figure 7.1-4, I compare my results with the results of research
groups from Worcester Polytechnic Institute [15] and University of Southern California
[12]. Both groups used identical FPGA devices, the same design tools and similar design
procedure. The order of the AES algorithms in terms of the encryption and decryption
throughput is identical in reports of all research groups. Serpent in architecture I8 (see
Figure 6.4-2) and Rijndael are over twice as fast as remaining candidates. Twofish and
121
RC6 offer medium throughput. Mars is consistently the slowest of all candidates.
Interestingly, all candidates, including Mars are faster than Triple DES. Serpent I8 (see
Figure 6.4-2) is significantly faster than Serpent I1 (Figure 6.4-3), and this architecture
should clearly be used in cipher feedback modes whenever the speed is a primary
concern, and the area limit is not exceeded.
050100150200250300350400450500Throughput [Mbps]
Serpent I8
Rijndael Twofish RC6 MarsSerpent I1
431 444414
353
294
177 173
104
149
62
143112
88102
61
Worcester Polytechnic Institute
University of Southern California
Our Results
050100150200250300350400450500Throughput [Mbps]
Serpent I8
Rijndael Twofish RC6 MarsSerpent I1
431 444414
353
294
177 173
104
149
62
143112
88102
61
Worcester Polytechnic Institute
University of Southern California
Our Results
Figure 7.1-3 Throughput for Virtex XCV-1000, comparison with results
of other groups.
The agreement among circuit areas obtained by different research groups is not as
good as for the circuit throughputs, as shown in Figure 7.1-4. These differences can be
explained based on the fact that the speed was a primary optimization criterion for all
involved groups, and the area was treated only as a secondary parameter. Additional
122
differences resulted from different assumptions regarding sharing resources between
encryption and decryption, key storage, and using dedicated memory blocks.
010002000
30004000
50006000700080009000
Serpent I8RijndaelTwofish RC6 MarsSerpent
I1
Area [CLB slices]
Worcester Polytechnic Institute
University of Southern California
Our Results
1250
5511
1076
28092666
11371749
26382507
4312
35282744
4621 4507
7964
010002000
30004000
50006000700080009000
Serpent I8RijndaelTwofish RC6 MarsSerpent
I1
Area [CLB slices]
Worcester Polytechnic Institute
University of Southern California
Our Results
1250
5511
1076
28092666
11371749
26382507
4312
35282744
4621 4507
7964
Figure 7.1-4 Area for Virtex XCV-1000, comparison with results of other
groups.
Despite these different assumptions, the analysis of results presented in Figure
7.1-4 leads to relatively consistent conclusions. All ciphers can be divided into three
major groups:
1. Twofish and RC6 require the smallest amount of area,
2. Rijndael and Mars require medium amount of area (at least 50% more than
Twofish and RC6),
123
3. Serpent I8 requires the largest amount of area (at least 60% more than
Rijndael and Mars). Serpent I1 belongs to the first group according to [12],
and to the second group according to [15].
The overall features of all AES candidates can be best presented using a two-
dimensional diagram showing the relationship between the encryption/decryption
throughput and the circuit area. I collected my results for the Xilinx Virtex FPGA
implementations in Figure 7.1-5. For comparison I show the results obtained by the NSA
group for ASIC implementations [33] in Figure 7.1-6.
Throughput [Mbps]
Area [CLB slices]
0
100
200
300
400
500
0 1000 2000 3000 4000 5000
Rijndael Serpent I8
Mars
RC6
TwofishSerpent I1
Throughput [Mbps]
Area [CLB slices]
0
100
200
300
400
500
0 1000 2000 3000 4000 5000
Rijndael Serpent I8
Mars
RC6
TwofishSerpent I1
Figure 7.1-5 Throughput vs. area for Virtex-1000, our results. The result
for Serpent I1 based on [12].
124
Comparing diagrams shown in Figure 7.1-5 and Figure 7.1-6 reveals that the
throughput/area characteristic of the AES candidates is almost identical for the FPGA and
ASIC implementations. The primary difference between the two diagrams comes from
the absence of the ASIC implementation of Serpent I8 in the NSA report [33].
Throughput [Mbps]
Area [mm2]
0
100
200
300
400
500
600
700
0 5 10 15 20 25 30 35 40
Serpent I1
RC6 Twofish Mars
Rijndael
Throughput [Mbps]
Area [mm2]
0
100
200
300
400
500
600
700
0 5 10 15 20 25 30 35 40
Serpent I1
RC6 Twofish Mars
Rijndael
Figure 7.1-6 Throughput vs. area for 0.5�m CMOS standard-cell ASICs,
NSA result.
All ciphers can be divided into three distinct groups:
� Rijndael and Serpent I8 offer the highest speed at the expense of the
relatively large area;
� Twofish, RC6, and Serpent I1 offer medium speed combined with a very
small area;
125
� Mars is the slowest of all AES candidates and second to last in terms of the
circuit area.
Looking at this diagram, one may ask which of the two parameters: speed or area
should be weighted more during the comparison? The definitive answer is speed. The
primary reason for this choice is that in feedback cipher modes it is not possible to
substantially increase encryption throughput even at the cost of a very substantial
increase in the circuit area. On the other hand, by using resource sharing described in
section 3.3.2, the designer can substantially decrease circuit area at the cost of a
proportional (or higher) decrease in the encryption throughput. Therefore, Rijndael and
Serpent can be implemented using almost the same amount of area as Twofish and RC6;
but Twofish and RC6 can never reach the speeds of the fastest implementations of
Rijndael and Serpent I8.
7.2 Comparison of ciphers in non-feedback modes
The results of my implementations of four AES candidates using full mixed inner-
and outer-round pipelining and Virtex XCV-1000BG560-6 FPGA devices are
summarized in Figure 7.2-1, Figure 7.2-2 and Figure 7.2-3. Because of the lack of time I
did not attempt to implement Mars in this architecture. In Figure 7.2-4, I provide the
results of implementation of all five AES finalists by the NSA group using full outer-
round pipelining and semi-custom ASICs in 0.5 �m CMOS MOSIS library [33].
126
0
2
4
6
8
10
12
14
16
18Throughput [Gbps]
Serpent RijndaelTwofish RC6
16.815.2
13.1 12.2
0
2
4
6
8
10
12
14
16
18Throughput [Gbps]
Serpent RijndaelTwofish RC6
16.815.2
13.1 12.2
Figure 7.2-1 Throughput for mixed inner- and outer-round pipelining in
Virtex1000, my results.
To my best knowledge, the throughputs of the AES candidates obtained as a result
of my design effort, and shown in Figure 7.2-1, are the best ever reported, including both
FPGA and ASIC technologies.
127
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Serpent RijndaelTwofish RC6
Area [CLB slices]
19,700 21,000
46,900
12,60080 RAMs
dedicated memory blocks, RAMs
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Serpent RijndaelTwofish RC6
Area [CLB slices]
19,700 21,000
46,900
12,60080 RAMs
dedicated memory blocks, RAMs
Figure 7.2-2 Area for mixed inner- and outer-round pipelining on
Virtex1000, my results.
Latency without and with pipelining [�s]
Serpent I8 RijndaelTwofish RC6
297733 722
3092
897
5490
309737
x 2.5
x 4.3
x 6.1
x 2.4
6
5
4
3
2
1
0
Latency without and with pipelining [�s]
Serpent I8 RijndaelTwofish RC6
297733 722
3092
897
5490
309737
x 2.5
x 4.3
x 6.1
x 2.4
6
5
4
3
2
1
0
Figure 7.2-3 Increase in the encryption/decryption latency as a result of
moving from the basic architecture to mixed inner- and outer-round
pipelining.
128
My designs outperform similar pipelined designs based on the use of identical
FPGA devices, reported in [15], by a factor ranging from 3.5 for Serpent to 9.6 for
Twofish. These differences may be attributed to using a sub-optimum number of inner-
round pipeline stages and to limiting designs to single-chip modules in [15]. My designs
outperform NSA ASIC designs in terms of the encryption/decryption throughput by a
factor ranging from 2.1 for Serpent to 6.6 for Twofish (see Figure 7.2-1 and Figure
7.2-4). Since both of the groups obtained very similar values for throughputs in the basic
iterative architecture (see Figure 7.1-5 and Figure 7.1-6), these large differences should
be attributed primarily to the differences between the full mixed inner- and outer-round
round architecture employed by me and the full outer-round architecture used by the
NSA team.
Serpent Rijndael Twofish RC6 Mars
Throughput [Gbps]
0
1
2
3
4
5
6
7
8
9
2.2
5.7
2.3 2.2
8.0
Serpent Rijndael Twofish RC6 Mars
Throughput [Gbps]
0
1
2
3
4
5
6
7
8
9
2.2
5.7
2.3 2.2
8.0
Figure 7.2-4 Throughput for 0.5 �m CMOS standard-cell ASICs, NSA
results.
129
By comparing Figure 7.2-1 and Figure 7.2-4, it can be clearly seen that using full
outer-round pipelining for comparison of the AES candidates favors ciphers with less
complex cipher rounds. Twofish and RC6 are over two times slower than Rijndael and
Serpent I1, when full outer-round pipelining is used (Figure 7.2-4); and have the
throughput greater than Rijndael, and comparable to Serpent I1, when full mixed inner-
and outer-round pipelining is applied (Figure 7.2-1). Based on my basic iterative
architecture implementation of Mars, I predict that the choice of the pipelined
architecture would have the similar effect on Mars.
The deviations in the values of the AES candidate throughputs in full mixed
inner- and outer-round pipelining do not exceed 20% of their mean value. The analysis of
critical paths in my implementations has demonstrated that all critical paths contain only
a single level of CLBs and differ only in delays of programmable interconnects. Taking
into account already small spread of the AES candidate throughputs and potential for
further optimizations, I conclude that the demonstrated differences in throughput are not
sufficient to favor any of the AES algorithms over the other. As a result, circuit area
should be the primary criterion of comparison for our architecture and non-feedback
cipher modes.
As shown in Figure 7.2-2, Serpent and Twofish require almost identical area for
their implementations based on full mixed inner- and outer-round pipelining. RC6
imposes over twice as large area requirements. Comparison of the area of Rijndael and
other ciphers is made difficult by the use of dedicated memory blocks, Block
130
SelectRAMs, to implement S-boxes. Block SelectRAMs are not used in implementations
of any of the remaining AES candidates, and I am not aware of any formula for
expressing the area of Block SelectRAMs in terms of the area used by CLB Slices.
Nevertheless, I have estimated that an equivalent implementation of Rijndael, composed
of CLBs only, would take approximately 24,600 CLBs, which is only 17 and 25 percent
more than implementations of Twofish and Serpent respectively.
Additionally, Serpent, Twofish, and Rijndael all can be implemented using two
FPGA devices XCV-1000, while RC6 requires four such devices. It should be noted that
in my designs, all implemented circuits perform both encryption and decryption. This is
in contrast with the designs reported in [15], where only encryption logic is implemented,
and therefore a fully pipelined implementation of Serpent can be included in one FPGA
device.
Connecting two or more Virtex FPGA devices into a multi-chip module working
with the same clock frequency is possible because the FPGA system level clock can
achieve rates up to 200 MHz [34], and the highest internal clock frequency required by
the AES candidate implementation is 131 MHz for Serpent. New devices of the Virtex
family released in 2001 are capable of including full implementations of Serpent,
Twofish, and Rijndael on a single integrated circuit.
In Figure 7.2-3, I report the increase in the encryption/decryption latency resulting
from using the inner-round pipelining with the number of stages optimum from the point
of view of the throughput/area ratio. In majority of applications that require hardware-
based high-speed encryption, the encryption/decryption throughput is a primary
131
performance measure, and the latencies shown in Figure 7.2-3 are fully acceptable.
Therefore, in this type of applications, the only parameter that truly differentiates AES
candidates, working in non-feedback cipher modes, is the area, and thus the cost, of
implementations. As a result, in non-feedback cipher modes, Serpent, Twofish, and
Rijndael offer very similar performance characteristics, while RC6 requires over twice as
much area and twice as many Virtex XCV-1000 FPGA devices.
132
Summary
I have implemented all five final AES candidates in the basic iterative
architecture, suitable for feedback cipher modes, using Xilinx Virtex XCV-1000 FPGA
devices. For all five ciphers, I have obtained the best throughput/area ratio, compared to
the results of other groups reported for FPGA devices. Additionally, I have implemented
four AES algorithms using full mixed inner- and outer-round pipelining suitable for
operation in non-feedback cipher modes. For all four ciphers, I have obtained throughputs
in excess of 12 Gbps, the highest throughputs ever reported in the literature for hardware
implementations of the AES candidates, taking into account both FPGA and ASIC
implementations.
I have developed the consistent methodology for the fast implementation and fair
comparison of the AES candidates in hardware. I have found out that the choice of an
optimum architecture and a fair performance measure is different for feedback and non-
feedback cipher modes.
For feedback cipher modes (CBC, CFB, OFB), the basic iterative architecture is
the most appropriate for comparison and future implementations. The
encryption/decryption throughput should be the primary criterion of comparison, because
using a different architecture, even at the cost of a substantial increase in the circuit area,
cannot easily increase it. Serpent and Rijndael outperform three remaining AES
133
candidates by at least a factor of two in both throughput and latency. Two independent
research groups have confirmed my results for feedback modes.
For non-feedback cipher modes (ECB, counter mode), architecture with full
mixed inner- and outer-round pipelining is the most appropriate for comparison and
future implementations. In this architecture, all AES candidates achieve high, and
approximately the same throughput. As a result, the implementation area should be the
primary criteria of comparison. Implementations of Serpent, Twofish, and Rijndael
consume approximately the same amount of FPGA resources. RC6 requires over twice as
large area. My approach to comparison of the AES candidates in non-feedback cipher
modes is new and unique, and has yet to be followed, verified, and confirmed by other
research groups.
My analysis leads to the following ranking of the AES candidates in terms of the
hardware efficiency: Rijndael and Serpent close first, followed in order by Twofish, RC6,
and Mars. Figure 7.1-5 clearly indicates that Rijndael offers high throughput and best
throughput/area ratio in basic iterative architecture. Figure 7.2-2 shows area requirements
for all ciphers implemented in fully pipelined architecture. None of those ciphers could fit
within the device I used for comparison, however Xilinx Inc. has already introduced and
extended family of FPGA devices: Virtex-E and Virtex II. FPGAs in Virtex-E family
have large amount of Block SelectRAMs and similar amount of CLB Slices as in Virtex
family. Considering the use of Virtex-E devices I could certainly implement most of the
candidates within one chip. Only RC6 is too big for the largest of the devices. Serpent
134
and Twofish could be implemented in one of the largest chips Virtex-E XCV2600E.
However, Rijndael appears to have smallest requirements as it can be implemented
entirely within one Virtex-E XCV1600E. Again, Rijndael takes lead in ciphers
comparison.
When I came to the AES3 conference in New York with my advisor, we attended
the reception before the conference sessions. Talking with other participants of the
conference we had a feeling that everyone would most likely see an American candidate
cipher as a winner of the contest, because the winner was going to become an American
government standard. From this point of view, MARS, RC6 and Twofish had an
advantage over remaining candidates. Rijndael was proposed by two, unknown
researchers from Europe, what was not a good omen for its acceptance. Serpent already
had a bad reputation of being very slow in software. At the AES3 conference we have
presented a paper [19], which focused on implementations of AES candidates in basic
iterative architecture. We have shown my results, which are summarized in Figure 7.1-1
and Figure 7.1-4. Other research groups have presented similar results as shown in Figure
7.1-3. At the end of the AES3 conference all participants were asked to fill a survey,
where everyone could highlight his/her choice for AES standard. The results of the
survey are presented in Figure 8-1.
135
0102030405060708090100
SerpentRijndael Twofish RC6 Mars
# votes
0102030405060708090100
SerpentRijndael Twofish RC6 Mars
# votes
Figure 8-1 Results of survey filled by participants of the AES3
conference.
The opinion voiced by AES3 participants is surprisingly well correlated with the
results of our research.
The winner of the contest has been finally announced in August 2000. Rijndael
has become the AES, and will protect US government data well into 21st century.
The AES contest is over, but the results of my research are of interest. All the
finalists proved to be equally secure, and may find uses in real applications. I have
already encountered requests for including all remaining candidate algorithms in secure
communication standards as optional algorithms. My research results may guide
hardware implementers of those algorithms.
136
Rijndael has been officially approved for the AES in November 2001. It will
become a required algorithm for all most important secure communication protocols like
IPSec. I have started my research focusing on implementing AES for gigabit IPSec, and
have already presented an implementation of Rijndael in basic iterative architecture,
which achieves a throughput of 577 Mbps [8]. I am currently working on matching a 1
Gbps requirement for gigabit IPSec.
137
List of References
138
List of References
[1] R. Anderson, E. Biham, L. Knudsen, Serpent: A Proposal for the AdvancedEncryption Standard, NIST AES Proposal, June 1998.
[2] Advanced Encryption Standard Development Effort, http://www.nist.gov/aes.
[3] ANSI X9.52, Triple Data Encryption Algorithm Modes of Operation, 1998.
[4] C. Burwick, D. Coppersmith, E. D’Avignon, R. Gennaro, S. Halevi, C. Jutla,S. Matyas, L. O’Connor, M. Peyravian, D. Safford, N. Zunic, MARS – acandidate cipher for AES, NIST AES Proposal, June 1998.
[5] E. Biham, A. Shamir, Differential cryptanalysis of DES-like cryptosystems,Technical report CS90-16, Weizmann Institute of Science, CRYPTO'90 &Journal of Cryptology, Vol. 4, No. 1, pp. 3-72, 1991.
[6] E. Biham, A. Shamir, Differential Cryptanalysis of the full 16-round DES,Advances in Cryptology, CRYPTO’92, 1992.
[7] E. Biham, A. Shamir, Differential Cryptanalysis of the Data EncryptionStandard, Springer Verlag, 1993. ISBN: 0-387-97930-1, 3-540-97930-1.
[8] P. Chodowiec, K. Gaj, P. Bellows, B. Schott, Experimental Testing of theGigabit IPSec-Compliant Implementations of Rijndael and Triple DES UsingSLAAC-1V FPGA Accelerator Board, Proc. Information Security Conference,Malaga, Spain, October 1-3, 2001.
[9] P. Chodowiec, P. Khuon, K. Gaj, Fast Implementations of Secret-Key BlockCiphers Using Mixed Inner- and Outer-Round Pipelining, Proc. ACM/SIGDANinth International Symposium on Field Programmable Gate Arrays,FPGA’01, Monterey, February 2001, pp. 94-102.
[10] P. Chodowiec, W. Todryk, Hardware Encryptor for Hard Drives, WarsawUniversity of Technology, Faculty of Electronics and InformationTechnology, Senior Design Project, Warsaw, 1998.
[11] J. Daemen, V. Rijmen, AES Proposal: Rijndael, NIST AES Proposal, June1998.
[12] A. Dandalis, V. Prasanna, J. Rolim, A Comparative Study of Performance ofAES Final Candidates Using FPGAs, Proc. Cryptographic Hardware andEmbedded Systems Workshop, CHES 2000, Worcester, MA, Aug. 17-18,2000.
[13] Electronic Frontier Foundation and O’Reilly and Associates, Cracking DES:Secrets of Encryption Research, Wiretap Politics & Chip Design, July 1998.
[14] A. Elbirt, C. Paar, An FPGA Implementation and Performance Evaluation ofthe Serpent Block Cipher, Eighth ACM International Symposium on Field-Programmable Gate Arrays, Monterey, California, February 10-11, 2000.
[15] A. Elbirt, W. Yip, B. Chetwynd, C.Paar, An FPGA Implementation andPerformance Evaluation of the AES Block Cipher Candidate AlgorithmFinalists, Proc. 3rd Advanced Encryption Standard (AES) CandidateConference, New York, April 13-14, 2000.
[16] Federal Information Processing Standards Publication 46-3, Data EncryptionStandard, National Institute of Standards and Technology, 1999.
[17] Federal Information Processing Standards Publication 81, DES modes ofoperation, National Institute of Standards and Technology, 1980.
[18] Federal Information Processing Standards Publication 197, AdvancedEncryption Standard (AES), National Institute of Standards and Technology,2001.
[19] K. Gaj, P. Chodowiec, Comparison of the Hardware Performance of the AESCandidates Using Reconfigurable Hardware, Proc. 3rd Advanced EncryptionStandard (AES) Candidate Conference, New York, April 13-14, 2000.
[20] J. Hennessy, D. Patterson, Computer Architecture: A Quantitative Approach,Second Edition, 1995. ISBN: 1-55960-329-8.
[21] H. Leitold, W. Mayerwieser, U. Payer, K. Posch, R. Posch, J. Wolkerstorfer,A 155 Mbps Triple-DES Network Encryptor, Proc. Cryptographic Hardwareand Embedded Systems Workshop, CHES 2000.
[22] H. Lipmaa, P. Rogoway, D. Wagner, CTR-Mode Encryption, Comments toNIST concerning AES Modes of Operations, 2000.
140
[23] M. Matsui, Linear cryptanalysis method for DES cipher, Advances inCryptology, EUROCRYPT’93, 1993.
[24] J. Nechvatal, E. Barker, D. Dodson, M. Dworkin, J. Foti, E. Roback, StatusReport on the First Round of the Development of the Advanced EncryptionStandard, NIST report, August 1999.
[25] M. Peattie, Use Triple DES for Ultimate Virtex-II Design Protection, Xcelljournal, Issue 40, Summer 2001.
[26] M. Riaz, H. Heys, The FPGA Implementation of RC6 and CAST-256Encryption Algorithms, CCECE’99, Edmonton, Alberta, Canada, 1999.
[27] M. Rawski, L. Jozwiak, M. Nowicka, T. Luba, Non-Disjoint Decompositionof Boolean Functions and Its Application in FPGA-oriented TechnologyMapping, Proc. EUROMICRO’97, Budapest, Hungary, September 1-4, 1997.
[28] R. Rivest, M. Robshaw, R.Sidney, The RC6 Block Cipher, NIST AESProposal, June 1998.
[29] B. Schneier, J. Kelsey, D. Whiting, D. Wagner, C. Hall, N. Ferguson,Twofish: A 128-bit Block Cipher, NIST AES Proposal, June 1998.
[30] B. Schneier, J. Kelsey, D. Whiting, D. Wagner, C. Hall, N. Ferguson,Performance Comparison of the AES Submissions, Second AES CandidateConference, Rome, April 1999.
[31] A. Satoh, N. Ooba, K. Takano, E. D’Avignon, High-Speed MARS Hardware,Proc. 3rd Advanced Encryption Standard (AES) Candidate Conference, NewYork, April 13-14, 2000.
[32] S. Trimberger, R.Pang, A. Singh, A 12 Gbps DES Encryptor/Decryptor corein an FPGA, Proc. Cryptographic Hardware and Embedded SystemsWorkshop, CHES 2000.
[33] B. Weeks, M. Bean, T. Rozylowicz, C. Ficke, Hardware PerformanceSimulations of Round 2 Advanced Encryption Standard Algorithms, Proc. 3rd
Advanced Encryption Standard (AES) Candidate Conference, New York,April 13-14, 2000.
[34] Xilinx, Inc., Virtex 2.5V Field Programmable Gate Arrays, TheProgrammable Logic, Data Book, 2000.
141
[35] K. Gaj and P. Chodowiec, Fast implementation and fair comparison of thefinal candidates for Advanced Encryption Standard using FieldProgrammable Gate Arrays, Proc. RSA Security Conference -Cryptographer's Track, San Francisco, CA, April 8-12, 2001.
[36] Tetsuya Ichikawa, Tomomi Kasuya, Mitsuru Matsui, Hardware Evaluation ofthe AES Finalists, Proc. 3rd Advanced Encryption Standard (AES) CandidateConference, New York, April 13-14, 2000.