The Advanced Encryption Standard on a Reconfigurable Computer By Pradeep Kancharla Bachelor of Engineering Osmania University, 2001 ------------------------------------------------------------------- Submitted in Partial Fulfillment of the requirements for the Degree of Master of Science in the Department of Computer Science and Engineering University of South Carolina 2003 ____________________________ ____________________________ Department of Computer Science Department of Computer Science and Engineering and Engineering Director of Thesis First Reader
129
Embed
buell/Public_Data/reconfigurable_papers/... · Web viewReconfigurable Computer. By. Pradeep Kancharla. Bachelor of Engineering. Osmania University, 2001...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The Advanced Encryption Standard on aReconfigurable Computer
Department of Computer Science Dean of the Graduate School and Engineering Second Reader
ACKNOWLEDGEMENTS
I would like to express my deepest gratitude to my advisor Dr. Duncan A Buell for his untiring guidance and encouragement which made this thesis possible. I would like to thank my research group, the Reconfig, for their support during the preparation of the thesis. Last, but not the least, I wish to express my deepest appreciation and gratitude to my parents, sister in India for all their love and unfailing support throughout this years.
ii
Table of Contents
1. The Advanced Encryption Standard . . . . . . . . . . . . 01
2. The HC 36m – A Reconfigurable Computer. . . . . . .12
Each round except the final round consists of four different transformations. They are
ByteSub, ShiftRow, MixColumn and the Round Key Addition. The final
round does not contain MixColumn. The Round Key, which is used in the Round
Key Addition, is derived from the cipher key through a process called Key
Schedule. This can be done initially before the rounds or in parallel with the rounds.
The algorithm for Encryption and Decryption is given in pseudo-C code below. The
number of rounds in the code depends on the bit lengths of key and plaintext.
Key Schedule (Initial Cipher Key, Expanded Round Key);Round Key Addition (State, Round Key);For (I = 0; I < Number of Rounds; I ++)
{ ByteSub (State); ShiftRow (State); if (! Final Round) MixColumn (State); Round Key Addition (State, Round Key); }
Fig 2: Pseudo-C code for Encryption
viii
Key Schedule (Initial Cipher Key, Expanded Round Key) For (I = 0; I < Number of Rounds; I ++) { Round Key Addition (State, Round Key); if (I! = 0) InvMixColumn (State); InvByteSub (State); InvShiftRow (State); } Round Key Addition (State, Round Key);
Fig 3: Pseudo-C code for Decryption
The Key Schedule can be done either before the rounds or in parallel with the rounds.
In the Key Schedule the initial key is expanded to the length of block length
multiplied by one greater than the number of rounds. This will produce a different set of
key for each round which is used in Round Key Addition. As the Decryption is just
an inverse of Encryption, our emphasis will be on Encryption with a further explanation
of the differences for Decryption whenever required.
ByteSub Transformation:
This transformation works independently on each of the cells of the State. The
transformation consists of two parts. First, the multiplicative inverse of the byte is
calculated, followed by an affine transformation. The affine transformation to be applied
is given below:
ix
=
+
Fig 4: Affine transformation in ByteSub [4]
All the operations are done in GF (28). The multiplicative inverse is taken as ‘00’ mapped
onto itself. In the case of Decryption, called InvByteSub, an inverse of the affine
mapping done above is applied followed by taking the multiplicative inverse.
Since the bitwise operations in GF(28) are hard to implement in software, a different
approach is used in the actual implementation.
ShiftRow Transformation:
This transformation is applied independently to all the four rows. Each row is cyclically
shifted left by a different offset. The first row is not shifted at all. The offsets of each
row are determined by the block length. The following table gives the offsets in terms of
columns to be moved for varying block sizes.
Shift offsets Row 2 Row 3 Row 4BL = 128 1 2 3
x
BL = 192 1 2 3BL = 256 1 3 4
Table 2: Offsets of rows based on block lengths
In case of Decryption, called InvShiftRow, the rows are shifted back to nullify the
effect. That is, the rows are cyclically shifted left with offset equal to number of columns
of State minus the offset for Encryption.
MixColumn Transformation:
This transformation is applied independently on each column of the State. Each column
of the State is treated as a polynomial. For example, the first column in Fig1 can be
treated as a1x +a2x +a3x+a4. This polynomial is multiplied by a fixed polynomial
given by e(x)=03x +02x +01x+01, modulo x +1, in GF(28).
This can be done in matrix multiplication as follows:
=
Fig 5: Polynomial multiplication using matrices [4]
In the case of Decryption, called InvMixcolumn, each column is multiplied by the
polynomial d(x)=0Bx +0Dx +09x+0E, so that e(x) d(x) = 1.
xi
Round Key Addition:
In this transformation, the Round Key is added to the State. Addition in GF(28) is a
simple bit wise XOR. The round key is of the same length of the State. It is derived from
the initial cipher by means of Key Schedule.
Key Schedule:
The Key Schedule is the process of deriving the Round Key for each round from the
initial cipher key. This involves expansion of the initial key followed by selection of the
key for each round. The Round Key Addition is done once every round and an
additional Round Key Addition is done, before the rounds in the case of
Encryption, and after in Decryption. Since Round Key should be the same length as
Block, the total number of Round Key bits, called the Expanded Round Key, must
be the block length times one greater than the number of rounds. A pseudo-C
implementation of Key Schedule is explained below. The expanded key can be viewed as
an array of 32-bit words represented as W[nb*(nr+1)], where nb is the number of
columns in the State and nr is the number of rounds.
Key expansion is done differently for different key sizes. Let nk be the number of 32 bit
words in the key. The functions subbyte takes the 32-bit word and does a byte
xii
substitution on each of the bytes and returns a 32-bit word. The rotbyte performs a left
cyclic permutation by bytes on the input. The Col function returns a 32 bit words
packed from the bytes given as input. We can see that the Expanded Key also contains
the initial cipher key in its original form.
The function rcon(i) is Col(Rc[i],‘00’, ‘00’, ‘00’). Rc[i], also called
the round constant, is given by the following formula
Rc[1] = ‘01’
Rc[i] = ‘02’i-1
For (i = 0; i < nk; i ++) W[i] = Col(key[4*i], key[4*i+1], key[4*i+2], key[4*i+3]); For( i = nk; i < nb * ( nr + 1) ; i ++) { temp = W[i – 1]; if (nk <=6) if (i % nk == 0) temp = subbyte(rotbyte(temp)) ^ rcon(i/nk); else { if (i % nk == 0) temp = subbyte(rotbyte(temp)) ^ rcon(i/nk); else if (i % nk == 4) temp = subbyte(temp); } W[i] = W[i – nk] ^ temp;}
Fig 6: Pseudo-C implementation of Key Schedule [4]
For Encryption the necessary round bits are taken from W starting from index i = 0. For
Decryption it is the reverse. The round key taken for the last round will be used in the
first round in Decryption in the same order of bits.
xiii
The Galois Field GF(2 m ) :
A Galois Field GF (q) is a field with q elements, also called a finite field because there is
a finite number (q) of elements. A Primitive Element of GF(q) is an element ‘a‘ such that
every field element except zero can be expressed as a power of a . Each Galois Field has
at least one primitive element. If q = 2m, where m is any integer and 2m-1 is prime, the
elements of the field can be represented by polynomials whose coefficients are elements
of the field GF(2) that is 0 and 1. The primitive element of such a field would itself be
such a polynomial.
Arithmetic in GF (2 8 ):
As we see above, all the arithmetic is done at the byte level. In GF (28), the addition of
bits 1 and 1 is 0. This arithmetic cannot be implemented in software using standard
functions such as multiplication and division for finding the product and other values like
the multiplicative inverse. Manipulation of bits in software is complex and hard to debug.
Fortunately, however, since the Galois Field is represented as 8-bit values, the possible
input values for any unary operation will be one of 256 values. This can be utilized in
dividing the complex operation into a series of unary operations and performing the
unary operations using lookup tables instead of performing the actual arithmetic itself.
xiv
For example, we can do the multiplication by using the logarithm and antilogarithm
functions. Taking the logarithm of the multiplicand and multiplier can be done in one
step using a linear array of 256 values. Then these values can be added bitwise, which is
an XOR operation. The antilog can then be obtained by using another lookup table.
Arithmetic can thus be done much more easily at the cost of extra memory.
C code:
We are using a C implementation of the algorithm to test the results of the VHDL and
Viva implementation. The code is taken from Daemen and Rijnmen [3]. This
implementation is done using look-up tables. The lookup tables are stored as linear
arrays. The lookup tables are used in the ByteSub, Key Schedule and
MixColumn transformations. The independent operations on different elements in each
stage are done iteratively. The State is stored in a two dimensional array.
In the case of ByteSub, a for loop is used to iterate over all rows and columns of the
State. The transformation is done using a linear array of 256 elements from which the
input is used as an index to the array containing the transformed values. Thus the whole
transformation, of finding a multiplicative inverse and applying an affine transformation,
is done using a single lookup.
xv
The ShiftRow transformation is done based on the bit lengths. The shifts of each row
are stored in an array. Based on whether the algorithm is in an Encryption or Decryption
stage, appropriate shifts are fetched from the array and the rows are shifted accordingly.
The MixColumn transformation is implemented by running a for loop on the number
of columns. The multiplications are done using the log and antilog lookup tables. The log
and antilog values of all of the possible 256 inputs are stored in two linear arrays. The
appropriate value is retrieved using the index. Thus the multiplication can be done by
using three lookups (two log values and one antilog value) and an addition. This will
avoid doing the complex bit wise manipulations involved in actual Galois Field
arithmetic. The polynomial to be multiplied is stored as constants in the program.
The Round Key Addition is done using an XOR. The Key is passed as a two-
dimensional array and a for loop is used to iterate on all the cells of the State.
The Key Schedule uses two lookup tables for doing the rotation and substitution. A
three-dimensional array is used to store the Expanded Key. Key selection is done on the
primary index. The Key schedule is done before the rounds, and all the key bits are
stored in an array which is used in Encryption as well as Decryption.
The C code is used just to check the results of our other implementations in VHDL and
Viva®. We implemented the algorithm in different ways to evaluate their resource usage
and timing. The Code is added as an appendix A.
xvi
Chapter 2: HC36m – A Reconfigurable Computer
The platform we are targeting is an HC 36m Hypercomputer® developed by Star Bridge
Systems. The reconfigurable resources on the Hypercomputer comprise five Xilinx
Virtex-II 6000 and two Virtex-II 4000 FPGA chips organized in a proprietary manner.
The processing capability of this architecture is built upon four Processing Elements
(PEs). Each PE is a Xilinx Virtex-II 6000 FPGA chip connected to four DDR RAM
modules each of 512MB with a 90-bit wide communication link. The four PEs are
arranged in a “Quad Structure” passing through a cross-point, which is another Virtex-II
6000 chip with a 50-bit wide communication link to each PE. The Virtex-II 4000 chips
serve as a bus controller and a router. The 2.4GHz Xeon Processors on the host are
connected to the FPGA interface through a 64-bit bidirectional PCIX bus running at
66MHz. If the data to be sent is more than the available bit width, the PCIX bus muxes
the data to be sent.
xvii
Fig 7: Quad Structure [20]
.
Fig 8: Architecture of HC 36m [20]
The HC 36m comes with a development environment called
Viva®. Viva provides a graphical editor for designing
applications, which are then synthesized by Viva and mapped
onto hardware using Xilinx tools. The design need not be
constrained to a single chip, since Viva is capable of mapping
designs onto more than one chip. Viva also comes with a rich
library of objects which can be used in the design of
xviii
applications. A snapshot of the library objects is shown in the figure on the right. The
current version of the library comes in a sheet called corelib.
Fig 9: Corelib
The I2ADL editor provides a graphical interface for creating applications. A design can
be stored as a sheet. The sheets can be made into objects to be reusable in other designs.
Thus a user can create his/her own library of objects and reuse them just by loading the
sheet with its objects and dragging the objects onto the new sheet.
xix
Fig 10: Snapshot of Viva.
There are three more editors in Viva: the Data Set Editor, used to create new data sets;
the Resource Editor used for allocating resources; and the System Editor used to
manipulate constraints such as the EDIF file to be compiled, the system descriptions, the
clock period, and so forth. The object-oriented paradigm allows one to build designs
hierarchically, thus decreasing the complexity.
xx
The most important concept for any programming language, however, is debugging, and
debugging in Viva can be very difficult. The error messages given by Viva, for example,
have not been very useful. This makes programming difficult if something goes wrong.
The “widget interface” is not really sufficient for hardware designs. It would be more
useful if there were a way to see the timing diagram for a design on the hardware.
The current Viva version is Viva 2.3. This version has some enhancements over previous
versions in terms of synthesis time, but many of the existing designs that synthesized and
executed under previous versions are not compatible with this new version. We have had
to make some changes in our designs in order to migrate to the new version.
Chapter 3: VHDL Implementation
xxi
A VHDL implementation of the algorithm for 128-bit key and block size has been done
to compare with Viva the results in terms of silicon usage and delay. In this
implementation, lookup tables were used rather than doing the actual GF(28) arithmetic.
The code is added as Appendix B.
The lookup tables are stored as RAMs. There are a total of four lookup tables used in the
algorithm. Those lookup tables are stored in the files sbox_ram.vhd,
alogtable_ram.vhd, logtable_ram.vhd and rbox_ram.vhd. All lookup
tables take an index as input and output the value corresponding to that index. All the
lookup tables mentioned above except that in the rbox_ram.vhd file store 256 values
needed for a unary operation in GF (28). The rbox_ram.vhd file contains a lookup
table having 30 values required for the rotation operation in Key Schedule. They are
indexed starting from zero.
Since the algorithm works in GF(28), all the variables are defined to be large_int, a
subtype of integer that allows values in the range 0 to 255 only. Other packages and type
definitions are stored in the file packages.vhd.
Key Schedule is done before the rounds start and the Expanded Key is stored in arrays.
All the operations in the transformations of the round are implemented in parallel, in
contrast to the iterated approach used in the C implementation. This uses a great deal of
silicon resource but will have a minimum delay. The code is simulated using ModelSim
xxii
[15] and was synthesized using the Xilinx ISE [26] compiler. The various entities of the
algorithm are explained in order of complexity and hierarchy.
shiftrow:
Since we are dealing with only one bit-size, this transformation can be implemented
simply by routing the inputs to the appropriate outputs, and no silicon will be used for
this transformation. The code for this is in the shiftrow.vhd file. The inputs for this
entity are sixteen values of type large_int and the outputs are merely in a shuffled
order.
roundkey:
This entity is used to do an 8-bit XOR. Since in VHDL we have only a bit wise XOR
function, the functions conv_std_logic_vector and conv_integer, available
in the ieee.std_logic_arith package, are used for conversion between an
integer and a std_logic_vector. This entity has two inputs of type large_int
and outputs a single value of same type. The implementation is in the file
roundkey.vhd.
round_roundkey:
xxiii
The round_roundkey entity takes the State and key in the form of 32 inputs of type
large_int and performs an 8-bit XOR using one of the State and key inputs. For this,
sixteen roundkey entities are used. All the XORs are implemented in parallel. The
output is 16 large_int values, which comprise the State. This entity performs the
Round Key Addition transformation in the round. The implementation is in the
round_roundkey.vhd file.
round_sbox:
This entity does the ByteSub transformation using the lookup tables (RAM_sbox). The
round_sbox entity takes State in the form of sixteen large_int inputs and passes
them through sixteen RAM_sbox entities in parallel. The output is again the State. The
code corresponding to this entity is in round_sbox.vhd file.
addcmp:
This takes two inputs of type large_int and adds them modulo 255. The output is also
a large_int. The implementation is in addcmp.vhd file. This is used primarily in
multiplication, as explained below.
multiply:
xxiv
This entity takes two values to be multiplied as input and produces the product. All the
inputs and outputs are of type large_int. The entities used for this are the two lookup
tables RAM_logtable and RAM_alogtable. One of the inputs is given as an input to
the RAM_logtable entity. The output would be the log value of the input. The other
input is itself the log value, since it is always constant. The log value is given as an input
to avoid another lookup. These two values are given as inputs to the addcmp entity. The
output of addcmp is passed to RAM_alogtable, which provides the product. Now
the inputs are checked for zeroes. If any of the inputs is zero, then the output is returned
as zero, or else the product is passed as the output. The code is in the Multiply.vhd
file.
mix:
This entity takes a column of the State shifted in different offsets. It multiplies these four
values with the constant polynomial used in the algorithm and adds the results. The
output is one cell of the State after the MixColumn transformation. For this entity the
inputs are the four large_ints and the output is a large_int. The other entities
used here are the multiply and roundkey. Since the polynomial used in the
multiplication has two coefficients of 1, multiplication with them is redundant. Thus,
only two multiplications are used to get the other two products. Later these four values
are added using the roundkey entities and the result is passed out. The code is in the
Mix.vhd file.
xxv
mixcolumn:
This entity performs the MixColumn transformation. It takes the State in the form of
sixteen large_ints and outputs the same after the transformation. For this purpose it
uses sixteen mix entities. This takes one column at a time and shifts them appropriately
and passes to them to the mix entities. The outputs of these entities are placed in the
corresponding places of the State. All the operations are done in parallel. The
implementation is in the MixColumn.vhd file.
keyshedule:
This entity takes key values for a round as an input and produces the key values for the
next round. The input is taken in the form of sixteen large_ints and the outputs are
stored in a key array. The different entities used here are roundkey,
RAM_sbox,RAM_rbox. The inputs are routed through these entities such that they
produce the desired output. The implementation is in the keyschedule.vhd file.
round:
xxvi
This constitutes a round of the algorithm. The inputs are the State and the key in the form
of 32 inputs, and the output is the State. Both inputs and outputs are of type
large_int. The State inputs are first routed through the round_sbox entity followed
by shiftrow, mixcolumn and roundkey. The key inputs are directly routed to
roundkey. The output would be the transformed State after applying one round
transformation. The implementation is in the round.vhd file.
lround:
This actually implements the last round of the algorithm, which is slightly different from
the remaining rounds. The only difference between round and lround is that the latter
does not have the mixcolumn entity. The output of the shiftrow is directly routed to
roundkey. The implementation is found in the lround.vhd file.
aes:
This entity connects all the pieces to complete the algorithm. The roundkey, round,
lround, keyschedule entities are used here. First, the key is passed as input to
keyschedule. There will be a series of ten keyschedule entities the output of each
of which is fed to the next. The initial key is fed as input to the first keyschedule. At
the end, the outputs of all the entities of keyschedule hold the Expanded Key for the
entire algorithm. The State is first passed through sixteen roundkey entities. This is the
initial Round Key Addition transformation performed prior to the rounds. Then we
xxvii
have nine round entities and one lround the output of one is passed to the other. The
output of lround is the required encrypted block. The code can be found in the file
aes.vhd.
Decryption:
Decryption is similar to Encryption, with minor differences explained below in terms of
entities for each transformation.
The first difference is the InvByteSub transformation. Instead of the RAM_sbox used
in Encryption, we use in Decryption an entity RAM_dsbox that contains the inverse of
the RAM_sbox values. The entity can be found in the file dsbox_ram.vhd file.
The InvShiftRow is the transformation that is applied to nullify the ShiftRow
transformation applied in Encryption. For this we use the entity dshiftrow which is in
the file dshiftrow.vhd. This is similar to shiftrow in the sense that it just routes
the inputs to the appropriate output to produce the effect of shifting. The shifting is done
such a way that it nullifies the shifting done in shiftrow.
The InvMixColumn differs from the MixColumn in two ways. First, there is a different
polynomial being multiplied times the State. Although the polynomial differs, it is stored
in terms of constants similar to the way done in MixColumn. The entity is
invmixcolumn and is implemented in the file invmixcolumn.vhd. The second
xxviii
difference is the mix in the Encryption. In Encryption we use a polynomial which has
two coefficients as ones. But here the polynomial does not contain coefficients as ones.
So we cannot avoid the multiply objects as in Encryption. The variation is shown in
the dmix entity in the file dmix.vhd.
The Round Key Addition transformation has no difference in the Encryption and
Decryption. We therefore use the same entities used in the Encryption for Decryption
also.
The order of the transformations in the round also changes in the Decryption. First the
input is routed to round_roundkey entity, which is followed by invmixcolumn,
dshiftrow and then by round_dsbox. The entity used is dround, and the
implementation can be seen in the file dround.vhd.
In Decryption, it is the first round, and not the last round, that differs from the other
rounds. The first round does not have invmixcolumn. The input is passed through
round_roundkey and then through dshiftrow and round_dsbox. The entity
representing this is the fround and the implementation can be seen in fround.vhd.
The key generation is similar to that of Encryption, but the keys are used in reverse order
compared to Encryption. The keys that are used for the first round are routed to the last
round in Decryption. Similarly, the key in used in the second round is used in the ninth
xxix
round in Decryption and the key used in the lround in the Encryption goes to fround
in Decryption. The entity used for this is the daes entity and is in the file daes.vhd.
All the designs are simulated using ModelSim and synthesized using Xilinx ISE.
The results of the implementation are given and analyzed in Chapter 5.
xxx
Chapter 4: Viva Implementation
Implementation of the algorithm is started by using lookup tables for multiplication.
Since the on-board memory has not been supported up to this point, we have used on-
chip memory to store the lookup tables. All the values of the lookup tables are read from
files stored on the host. These constants can be read to an input horn from files by adding
the following attributes.
Fig 11: Input from a file to an input horn
A file should exist at the location given beside the attribute Constant in the following
format.
xxxi
Fig 12: Format of file input
The value corresponding to the index attribute is used as an index to fetch the required
value from the file. The values in the files are synthesized as CONSTANTS into the
executable. In the example above, the value 99 is stored as a constant at the input horn. In
order to implement a lookup table, we can read all the values to the input horns and use a
multiplexer to get the required values. This approach has many problems, however.
There is no parameterized generate function to create all these in one step, and
opening the attributes list and adding a different value for the index and hard coding the
path name is a tedious job.
The index problem can be countered by using sixteen Mux(17,1) objects for storing
the values instead of a single Mux(257,1) object. This will allow us to use the horns
with the indexes given 0 to 15 for each Mux object instead of using all the values from 0
to 255 for each input horn.
xxxii
Fig 13: Lookup table
The 8-bit input is exposed and split into its most significant and least significant 4-bit
quantities. The LSBs are routed to all the sixteen Mux(17,1) objects. Only one of the
Muxes has the required output; this mux is selected by the other Mux object given to the
MSBs as the selective index.
The other problem faced is providing the path name to all the input horns. Initially we did
it for all the input horns as below.
xxxiii
Fig 14: Mux with pathnames given manually
Later we were made aware of an easier method for providing the file names to the
required input horns in the object. For this we create a Mux with input horns that has
Constant attributes initialized to *ROM_FILE.DTA. This will look like
Fig 15: Mux with pathnames pointing to a pointer
This Mux is then made into an object. To make this object point to a file, the object is
right-clicked and the attributes are changed as follows:
xxxiv
Fig 16: Setting the file pointer to a specific path
Initially the input values from the files were stored into registers. Although each lookup
table worked individually, there were problems with more than three lookup tables. When
we tried to synthesize more than three lookup tables, we got a C++ Exception error
followed by the corruption of the project. The frequency of this error diminished as new
versions of Viva were released, and later the mistake was corrected, resulting both in
decreased silicon usage and compilation time. Due to increasing problems with the
lookup tables we thought to import an EDIF module for some basic operations in the
algorithm. However, the EDIF generated using VHDL was not compatible with Viva. We
were later provided with a php script to do the conversion, but this did not seem to be
sufficient for our needs.
Iterative approach:
The initial implementation of an iterative approach of the algorithm was targeted at
minimal usage of silicon on the chip. There are two reasons for this. First, there was no
multi-chip communication available in Viva at that time. Our VHDL implementation
showed that a full parallel version would take two chips if Viva synthesis tool was as
xxxv
efficient as the standard Xilinx tools. Second, there were some problems encountered in
using many lookup tables. The Encryption was implemented by doing the Key
Schedule on the fly. For Decryption, the Key Schedule was done at first before
the rounds and the Expanded Key stored to be used later.
Encryption:
An iterative approach was used in the ByteSub and MixColumn transformations
inside the round and on the round also. The lookup tables required for the Encryption are
the substitution box represented by the object sbox, the Logarithm table represented as
ltable, the Antilogarithm table represented as atable, and the Rotation Box
represented as rbox. All these tables except rbox have 255 values and all are
constructed as explained above. The values are read from the files in the directories
sbox, ltable and atable placed under the directory C:\Pradeep\ on Odo
respectively. The files must be placed at that location only, since the path must be hard
coded in the design in the early versions of Viva.
Since implementation is done in an object oriented paradigm, the explanation below is
given in terms of objects created. The Encryption is a loop on the object round, which
represents a single round of the algorithm. The initial values of the key and block are
passed through the roundkey object. The output of the roundkey object and the
initial key are passed to the round object and is then looped ten times using the For
object of the Viva library. The feedback is done using the reginit objects. The
xxxvi
appropriate input to the round object for the first round and the subsequent rounds are
selected by using the N value of the For object.
round:
In essence, the round routes the data from one stage to other stage. The Key Schedule
is done on the fly as a part of the round. The inputs for the round are the block and key of
the previous round. The substituted values required for the Key Schedule are
calculated in the round_sbox object only. The N value of the outer For loop is used to
eliminate the MixColumn stage in the tenth round. It is also incremented and used as a
pointer for the rotation box of the Key Schedule. ShiftRow is implemented by
simply routing the outputs of the round_sbox object to appropriate inputs of
round_mixcolumn object.
Fig 17: Design of a round in Encryption
xxxvii
round_sbox:
The round_sbox is a loop over sbox4 that calculates the substituted values for a
column of the State. The outputs of all iterations are registered. The appropriate set of
registers is selected using the decode object. The N value of the For object is passed
into the decode which compares it with values form 0 to 4 and sets the corresponding
output bit high. The `done’ of the sbox4 object is used to give a pulse to the next input
of the For object. The inputs are muxed and passed into the sbox4 based on the N
value of the For object. The additional four values calculated are for Key Schedule.
sbox4:
This object is a loop around the sbox object and gives the substituted value for its input.
The outputs of all iterations are registered similarly as explained above.
round_mixcolumn:
This object is a loop around the mixcolumn object; it multiplies the column of the State
with a constant polynomial. The inputs are muxed and passed into the mixcolumn
object and the outputs are registered using the For object.
xxxviii
mixcolumn:
The mixcolumn object shifts the column by one for every iteration and passes them
into the mix objects, which multiply the given input with a polynomial. The output of
iteration corresponds to a cell of the output column. The output is registered using the
RegEn object based on the iteration.
mix:
This object is a loop around the multiply object. The polynomial with which the
column is to be multiplied is stored in terms of constants. In order to eliminate one table
lookup, the logarithmic values of the coefficients of the polynomial are stored instead of
the coefficients themselves. The outputs are registered and XORed after all the iterations
are completed to get the desired value.
multiply:
Two values, the coefficient of the polynomial and the other value of the State are the
inputs for multiply. The State value is passed through the ltable and the output is
added with the other input. The ADC object is used for this purpose. We need addition
modulo 255, which requires that we adjust the ADC output with the overflow bit to
obtain the desired results in all instances. The resulting value is passed through atable
xxxix
to get the product. The inputs are checked for zero. If any input is zero, then the output of
the atable is neglected and zero is passed as output.
roundkey:
The roundkey object is a collection of XOR gates that XOR the key for this particular
round with the State. All the XORs are done in parallel.
keyschedule:
Key Schedule is done on the fly in the case of Encryption. The index for the rotation box
is calculated based on the iteration. The substituted values required are calculated in the
round_sbox object itself and the values are passed to the key schedule.
The decode objects is used in almost all of the above objects. It functions as a DeMux.
A Value is passed through the Equal objects from the Viva libraries, which are initialized
to all the possible values of the input. The appropriate output based on the input is set
high.
Decryption:
The basic difference between Encryption and Decryption is the Key Schedule. The
Key Schedule is done before the rounds in this instance. The keys for all the rounds
xl
are stored in a stack-like structure, from which the key for the round is retrieved in every
iteration. All the other stages of the Decryption are similar to Encryption and require little
explanation. The round_isbox has only four iterations, since the values required for
the Key Schedule need not be calculated. The imix of the round_imixcolumn
takes a different polynomial from the one used in Encryption.
The keysh object is a loop around the keyschedule object explained above. The
outputs are packed and registered for every iteration. Later they are routed in reverse
order (since we require the keys in reverse order in Decryption) into a Mux. The selection
in the Mux is given the N value of the For loop. The rounds are started after the Key
Schedule is done.
The files corresponding to isbox of round_isbox object are stored at C:\
Pradeep\isbox on Odo.
Fig18: Design of round in Decryption
xli
Expanding the loop on the round:
Since our main aim is to use maximum resources in terms of silicon, we started by
expanding the loop on the round in order to check the efficiency of Viva in synthesizing a
larger design. For this, some changes were made to the round object explained above.
The object described above uses a Mux to eliminate the MixColumn transformation in
the final round in the Encryption and InvMixColumn transformation in the first round
for Decryption. Since we are using different objects for every round, a round object was
created with a mixcolumn object and without any Mux for all the rounds except the
last one in Encryption and the first one in Decryption. Another object lround was
created for Encryption; this is a round without a mixcolumn, and similarly in the case
for Decryption. The same approach used as above in case of Key Schedule. Key is
calculated on the fly in case of Encryption; for Decryption we used the keysh object
explained above. Both designs worked, and the results are given in the next chapter.
Non-iterative Approach:
A non-iterative approach is started by expanding the loops in ByteSub stage and also in
the MixColumn stage. Given the fact that a single lookup table took 160 slices, which is
a little less thrice the number needed in the VHDL implementation, the whole algorithm
using lookup tables cannot be done in four chips if we were to expand ByteSub and
MixColumn completely. We have thus settled for iteration on these stages. The
ShiftRow is done prior to the ByteSub to accommodate this. Then for the first
xlii
iteration the first two columns will be to plsbox8 object that has eight sbox objects.
Then the output of plsbox8 is passed to two plmix4 objects. The plmix4 object
multiplies a column with a polynomial and outputs the transformed column. The
plmix4 object has 4 plmix objects. The input for plmix4 is routed to each of these
objects by shifting them one at a time. The plmix object has four plmult objects that
multiply the coefficients. The outputs of the plmult objects are XORed to produce the
desired result. The outputs of the two iterations done on these stages are registered using
RegEn objects. The N value of the For loop which is used for iterations is used to
enable the appropriate set of registers. The Key Schedule is done in parallel to this
operation. The object corresponding to this is plkeyschedule. It used four sbox
objects for obtaining the substituted values. Once the iterations are finished, the
registered values and the output of plkeyschedule are passed to the roundkey
object to complete the round transformation. There are a total of seventy six lookup
tables in total in this round.
Fig 19: Design of round in Encryption
xliii
Implementing the multiplication in arithmetic:
Since Viva was not able to synthesize the design with many lookup tables, the
implementation was changed by replacing the lookup tables with the actual arithmetic.
Actually, as per the algorithm, we are not required to implement the whole multiplication
in the arithmetic. Since the polynomial used in Encryption and Decryption is a constant,
two objects were designed that multiply a column of the State with the polynomial used
in Encryption and Decryption.
The multiplication is done in the Galois Field GF(28). In polynomial representation, the
multiplication corresponds to a product of polynomials modulo an irreducible binary
polynomial of degree 8. The polynomial used in the algorithm is x8+x4+x3+x+1, which
can be represented in hexadecimal notation as ‘11B’.
Multiplication by the polynomial x, which can be represented in hexadecimal notation as
‘02’, is a left shift followed by a conditional XOR. If the left shift results in a carry, then
the result of the shift is XORed with ‘1B’. The polynomial used in Encryption has
coefficients ‘03’, ‘01’, ‘01’ and ‘02’. Multiplication with ‘02’ is done as explained above.
Multiplication with ‘01’ is the number itself. Multiplication with ‘03’ is split into
multiplication with ‘02’ plus multiplication ‘01’. The addition is again an XOR. The
polynomial used in Decryption has the coefficients ‘09’, ‘0B’, ‘0D’, and ‘0E’. All these
are also split in terms of powers of two and XORed at the end. For example,
xliv
multiplication with ‘09’ is split into multiplication by ‘08’ XORed with multiplication by
‘01’. Multiplication by ‘08’ is achieved by three successive multiplications by ‘02’. Since
all the coefficients are multiplied in parallel and XORed, the maximum number of shifts
done in succession is equal to three in Decryption and one in Encryption. This eliminates
a number of lookup tables, thus reducing the chip resources used.
The left shift in Viva is implemented using the RCL objects available in the corelib
library. The carryover is fed as an input to the Mux to do the conditional XOR. The
irreducible polynomial with which the result of the shift is XORed is given as a constant.
The object is named mulbyx.
Fig 20: Multiplication by x or ‘02’
The object cmmix is used to multiply a column with the polynomial to produce one
coefficient of the result. The multiplication with the polynomial in Encryption is
implemented as follows.
xlv
Fig 21: cmmix object
The complete multiplication of the polynomial with the column of the State is
implemented by shifting the column and passing it as an input to the cmmix object. The
object corresponding to that is the cmmix4 object. The MixColumn transformation is
accomplished by using four cmmix4 objects in parallel.
Since the use of arithmetic to do the MixColumn transformation reduces the silicon
usage, a full-fledged parallel implementation can be done in the round. Previously, in
case of lookup tables, both MixColumn and ByteSub stages were iterated once in
order to make two-and-one-half rounds fit on a single chip. But in that case much of the
chip is used in the MixColumn stage due to the excessive usage of lookup tables. When
these tables are eliminated, a round needs only a little more than a tenth of a chip in case
of Encryption when implemented with no iteration.
xlvi
Fig 22: Design of round in Encryption
Fig 23: Design of round Decryption
Due to enormous synthesis times, however, the whole algorithm could not be synthesized
onto one chip. Therefore, we attempted to use two chips by placing five rounds on each
chip. Although the synthesis completed, the design did not produce correct output.
Debugging was difficult as the synthesis time was about two days.
xlvii
Viva 2.3:
The initial problem with Viva 2.3 was that it did not handle files for constants. The initial
work-around proposed by Star Bridge would have required relabelling all the input horns.
Since there were 256 such horns in our initial design, this was viewed as an unacceptable
“solution.” We therefore decided to import an EDIF file generated by a VHDL
implementation. A single lookup table done in this manner took 72 slices, compared to
the 160 slices taken previously by a Viva object. Given that we had 200 lookup tables in
the entire implementation, the silicon usage was reduced by 17,600 slices, and as a result
the whole algorithm synthesized into less than half of one chip.
The implementation of the lookup table in VHDL is done using an array. The EDIF file is
generated using the fc2 compiler. This EDIF file is ported into Viva using a script written
by Heather A. Wake [25]. There are some problems with the EDIFs generated using
Synopsys, but these problems did not appear in this particular use of the Synopsys tool.
xlviii
Chapter 5: Results and Conclusions
Results on VHDL:
We used ModelSim [15] to simulate the algorithm and the Xilinx ISE tools [26] to
synthesize the code. The results for independent blocks are tabulated below. The
synthesis has been done for a Virtex2 device xc2v6000, package ff1152, speed -
4. The par statistics are generated by the Xilinx tools.
The architectures A1 to A4 were done with iterations inside the stages. The A5 iteration
was actually aimed at implementing the architecture used in VHDL to compare the
resource usage and timing. But since the synthesis tool in Viva is not as efficient as
standard synthesis tools, the algorithm cannot be implemented without iterations. Worse
yet, we could not complete the full algorithm in the architecture, since Viva failed to
synthesize more than one round on a single chip (even though one round takes much less
than half a chip).
Architecture Slices Clock cycles
Comments
A1 2285 4069 Works on Viva 2.2 but not on Viva 2.3A2 4656 4480 Works both on Viva 2.2 and 2.3A3 16393 4056 Works on Viva 2.2 but not on Viva 2.3A4 14395 4077 Works both on Viva 2.2 and 2.3A5 --- --- Only one iteration works on Viva 2.2.A6 15470
1Does not synthesize in Viva2.2 but by replacing the Viva lookup tables with VHDL lookup tables synthesized in
Viva 2.3A7 18653 1 Does not synthesize in Viva2.2 but by
replacing the Viva lookup tables with VHDL lookup tables synthesized in
Viva 2.3A8 --- --- Synthesizes on Viva 2.2 but does not
give correct results. Not required on Viva 2.3
Table 7: Results of various architectures in Viva2.2 and Viva2.3
The architecture A8 was implemented when A6 and A7 failed to synthesize in Viva 2.2.
Considering the fact that a single round of this architecture took around 10% of a chip,
the whole algorithm might be synthesizable on a single chip if we consider the overhead
for input and output. For the Viva 2.3 implementation, the lookup tables were replaced by
liii
VHDL modules. Since the architectures A6 and A7 synthesized on Viva 2.3, the
architecture A8 was not tested on Viva 2.3.
It has been a source of great frustration that we have not been able to test Viva on a
reasonable full AES design. Based on the synthesis of parts of AES using Viva and on
the synthesis of part and all of AES using standard synthesis tools, there should be no
fundamental obstacle to a complete AES implementation on the HC 36m. However, the
use of Viva to implement AES in its entirety will have to wait for a later and corrected
version of the software.
Throughput:
The problem with calculating the throughput of all the architectures on the HC 36m is the
inability of Viva to support what Star Bridge Systems refers to as FILE I/O, the transfer
of data from and to files on the host through the HC 36m hardware. Also, the hardware is
presently limited to a very slow speed due to the use of a rather primitive core doing the
communication on the PCIX bus.
But if we consider the core itself as we have implemented it, rather than considering the
limitations of the machine on it is implemented, we would achieve a significant increase
in throughput in Non-iterative architectures over the basic iterative architectures.
liv
Architecture Throughput (Gbps) Frequency of the clock(MHz)
A1 0.0012 40
A2 0.0011 40
A6 8.5334 66
A7 8.5334 66
Table 8: Throughput for different architectures
The throughputs listed in the table for architectures A6 and A7 do not reflect their actual
speeds since the HC 36m cannot be run faster than 66 MHz. In order to get an estimate of
actual throughput, we decided to run both A6 and A7 on a single chip routing the output
of A6 to A7 without any intermediate registers. The design took 33,790 slices, two less
than the total slices available on a single chip, and the design ran at a 15ns clock.
Theoretically, then, both A6 and A7 should have no more than an 8 ns delay. Based on
this, the throughput of A6 and A7 can be estimated to 16 Gbps at a 125 MHz clock
frequency.
Comparisons:
In any demonstration of technology, it is necessary to compare new results against those
already achieved by others. Listed below are some of the other commercial and
academic implementations of AES done on Virtex chips in a non-iterative approach.
lv
Design Device Throughput Slices BRAMs Frequency
P. Chodowiec et al [2] Virtex XCV1000 -6 12.16 12600 80 95
library ieee;use ieee.std_logic_1164.all;use ieee.std_logic_unsigned.all;package int_types is subtype large_int is integer range 0 to 255 ;end package;
port( sa1,sa2,sa3,sa4,sa5,sa6,sa7,sa8,sa9,sa10,sa11,sa12,sa13,sa14,sa15,sa16: in large_int; sb1,sb2,sb3,sb4,sb5,sb6,sb7,sb8,sb9,sb10,sb11,sb12,sb13,sb14,sb15,sb16: out large_int);end entity;
architecture behav of shiftrow isbeginstorage: process(sa1,sa2,sa3,sa4,sa5,sa6,sa7,sa8,sa9,sa10,sa11,sa12,sa13,sa14,sa15,sa16)
entity round_roundkey is port( aa1,aa2,aa3,aa4,aa5,aa6,aa7,aa8,aa9,aa10,aa11,aa12,aa13,aa14,aa15,aa16,k1,k2,k3,k4,k5,k6,k7,k8,k9,k10,k11,k12,k13,k14,k15,k16 : in large_int;b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,b11,b12,b13,b14,b15,b16:out large_int);end entity;
entity round_sbox is port( aa1,aa2,aa3,aa4,aa5,aa6,aa7,aa8,aa9,aa10,aa11,aa12,aa13,aa14,aa15,aa16 : in large_int;b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,b11,b12,b13,b14,b15,b16:out large_int);end entity;
beginblock1: mix port map(ma1,ma2,ma3,ma4,pc1,pc2,pc3,pc4,mb1);block2: mix port map(ma2,ma3,ma4,ma1,pc1,pc2,pc3,pc4,mb2);block3: mix port map(ma3,ma4,ma1,ma2,pc1,pc2,pc3,pc4,mb3);block4: mix port map(ma4,ma1,ma2,ma3,pc1,pc2,pc3,pc4,mb4);
lxxv
block5: mix port map(ma5,ma6,ma7,ma8,pc1,pc2,pc3,pc4,mb5);block6: mix port map(ma6,ma7,ma8,ma5,pc1,pc2,pc3,pc4,mb6);block7: mix port map(ma7,ma8,ma5,ma6,pc1,pc2,pc3,pc4,mb7);block8: mix port map(ma8,ma5,ma6,ma7,pc1,pc2,pc3,pc4,mb8); block9: mix port map(ma9,ma10,ma11,ma12,pc1,pc2,pc3,pc4,mb9);block10: mix port map(ma10,ma11,ma12,ma9,pc1,pc2,pc3,pc4,mb10);block11: mix port map(ma11,ma12,ma9,ma10,pc1,pc2,pc3,pc4,mb11);block12: mix port map(ma12,ma9,ma10,ma11,pc1,pc2,pc3,pc4,mb12); block13: mix port map(ma13,ma14,ma15,ma16,pc1,pc2,pc3,pc4,mb13);block14: mix port map(ma14,ma15,ma16,ma13,pc1,pc2,pc3,pc4,mb14);block15: mix port map(ma15,ma16,ma13,ma14,pc1,pc2,pc3,pc4,mb15);block16: mix port map(ma16,ma13,ma14,ma15,pc1,pc2,pc3,pc4,mb16);end architecture;
entity lround is port(aa1,aa2,aa3,aa4,aa5,aa6,aa7,aa8,aa9,aa10,aa11,aa12,aa13,aa14,aa15,aa16,k1,k2,k3,k4,k5,k6,k7,k8,k9,k10,k11,k12,k13,k14,k15,k16 : in large_int ; b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,b11,b12,b13,b14,b15,b16:out large_int);end entity;
architecture struct of lround issignaltap1,ap2,ap3,ap4,ap5,ap6,ap7,ap8,ap9,ap10,ap11,ap12,ap13,ap14,ap15,ap16:large_int;signaltap1,tap2,tap3,tap4,tap5,tap6,tap7,tap8,tap9,tap10,tap11,tap12,tap13,tap14,tap15,tap16:large_int;signalttap1,ttap2,ttap3,ttap4,ttap5,ttap6,ttap7,ttap8,ttap9,ttap10,ttap11,ttap12,ttap13,ttap14,ttap15,ttap16:large_int;
port(sa1,sa2,sa3,sa4,sa5,sa6,sa7,sa8,sa9,sa10,sa11,sa12,sa13,sa14,sa15,sa16: in large_int; sb1,sb2,sb3,sb4,sb5,sb6,sb7,sb8,sb9,sb10,sb11,sb12,sb13,sb14,sb15,sb16: out large_int);end component;
entity round is port( aa1,aa2,aa3,aa4,aa5,aa6,aa7,aa8,aa9,aa10,aa11,aa12,aa13,aa14,aa15,aa16,k1,k2,k3,k4,k5,k6,k7,k8,k9,k10,k11,k12,k13,k14,k15,k16,pc1,pc2,pc3,pc4 : in large_int ; b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,b11,b12,b13,b14,b15,b16:out large_int);end entity;architecture struct of round is
port( sa1,sa2,sa3,sa4,sa5,sa6,sa7,sa8,sa9,sa10,sa11,sa12,sa13,sa14,sa15,sa16: in large_int; sb1,sb2,sb3,sb4,sb5,sb6,sb7,sb8,sb9,sb10,sb11,sb12,sb13,sb14,sb15,sb16: out large_int);end component;
entity keyschedule is port ( a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,RCpointer :in large_int; owk: out keyarray;RCp: out large_int);end entity;
architecture behav of keyschedule iscomponent RAM_rbox isport(ma: in large_int; mb :out large_int);end component;component RAM_sbox isport( ma :in large_int ; mb :out large_int);
lxxviii
end component;component roundkey port( a,b :in large_int;c: out large_int);end component;signalaa1,aa2,aa3,aa4,aa5,aa6,aa7,aa8,out1,out2,out3,out4,out5,out6,out7,out8,out9,out10,out11,out12,out13,out14,out15,out0:large_int;
beginblock1: RAM_sbox port map(a14,aa1);block2: roundkey port map(a1,aa1,aa2);block3: RAM_sbox port map(a15,aa3);block4: roundkey port map(a2,aa3,out1);block5: RAM_sbox port map(a16,aa4);block6: roundkey port map(aa4,a3,out2);block7: RAM_sbox port map(a13,aa5);block8: roundkey port map(aa5,a4,out3);block9: RAM_rbox port map(RCpointer,aa6);block10: roundkey port map(aa6,aa2,out0);block11: roundkey port map(a5,out0,out4);block12: roundkey port map(a6,out1,out5);block13: roundkey port map(a7,out2,out6);block14: roundkey port map(a8,out3,out7); block15: roundkey port map(a9,out4,out8);block16: roundkey port map(a10,out5,out9);block17: roundkey port map(a11,out6,out10);block18: roundkey port map(a12,out7,out11); block19: roundkey port map(a13,out8,out12);block20: roundkey port map(a14,out9,out13);block21: roundkey port map(a15,out10,out14);block22: roundkey port map(a16,out11,out15); process(RCpointer,out1,out2,out3,out4,out5,out6,out7,out8,out9,out10,out11,out12,out13,out14,out15,out0) begin RCp<=RCpointer + 1; owk(0)<=out0 ; owk(1)<=out1 ; owk(2)<=out2 ; owk(3)<=out3 ; owk(4)<=out4;
entity aes is port(aa1,aa2,aa3,aa4,aa5,aa6,aa7,aa8,aa9,aa10,aa11,aa12,aa13,aa14,aa15,aa16,k1,k2,k3,k4,k5,k6,k7,k8,k9,k10,k11,k12,k13,k14,k15,k16,RCpointer,pc1,pc2,pc3,pc4 : in large_int ; b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,b11,b12,b13,b14,b15,b16:out large_int);end entity;
architecture struct of aes issignalap1,ap2,ap3,ap4,ap5,ap6,ap7,ap8,ap9,ap10,ap11,ap12,ap13,ap14,ap15,ap16:large_int;signaltap1,tap2,tap3,tap4,tap5,tap6,tap7,tap8,tap9,tap10,tap11,tap12,tap13,tap14,tap15,tap16:large_int;
port map(wpk9(0),wpk9(1),wpk9(2),wpk9(3),wpk9(4),wpk9(5),wpk9(6),wpk9(7),wpk9(8),wpk9(9),wpk9(10),wpk9(11),wpk9(12),wpk9(13),wpk9(14),wpk9(15),RC10,wpk10,RC11);
port( sa1,sa2,sa3,sa4,sa5,sa6,sa7,sa8,sa9,sa10,sa11,sa12,sa13,sa14,sa15,sa16: in large_int; sb1,sb2,sb3,sb4,sb5,sb6,sb7,sb8,sb9,sb10,sb11,sb12,sb13,sb14,sb15,sb16: out large_int);end entity;architecture behav of dshiftrow isbeginstorage: process(sa1,sa2,sa3,sa4,sa5,sa6,sa7,sa8,sa9,sa10,sa11,sa12,sa13,sa14,sa15,sa16)
entity round_dsbox is port( aa1,aa2,aa3,aa4,aa5,aa6,aa7,aa8,aa9,aa10,aa11,aa12,aa13,aa14,aa15,aa16 : in large_int;b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,b11,b12,b13,b14,b15,b16:out large_int);end entity;architecture struct of round_dsbox is
entity dmix is port( ma1,ma2,ma3,ma4,pc1,pc2,pc3,pc4 :in large_int; mb1:out large_int);end entity;architecture behav of dmix is
component multiplyport( ma1,ma2 :in large_int;mb1: out large_int);end component;
component roundkey port( a,b :in large_int;c: out large_int);end component;
signal aa1,aa2,aa3,aa4,aa5,aa6 :large_int;begin
block1: multiply port map(ma1,pc1,aa1);block2: multiply port map(ma2,pc2,aa2);block3: roundkey port map(aa1,aa2,aa3);block4: multiply port map(ma3,pc3,aa4);block5:multiply port map(ma4,pc4,aa5);
block6: roundkey port map(aa4,aa5,aa6);block7: roundkey port map(aa6,aa3,mb1);end architecture;
beginblock1: dmix port map(ma1,ma2,ma3,ma4,pc1,pc2,pc3,pc4,mb1);block2: dmix port map(ma2,ma3,ma4,ma1,pc1,pc2,pc3,pc4,mb2);block3: dmix port map(ma3,ma4,ma1,ma2,pc1,pc2,pc3,pc4,mb3);block4: dmix port map(ma4,ma1,ma2,ma3,pc1,pc2,pc3,pc4,mb4); block5: dmix port map(ma5,ma6,ma7,ma8,pc1,pc2,pc3,pc4,mb5);block6: dmix port map(ma6,ma7,ma8,ma5,pc1,pc2,pc3,pc4,mb6);block7: dmix port map(ma7,ma8,ma5,ma6,pc1,pc2,pc3,pc4,mb7);block8: dmix port map(ma8,ma5,ma6,ma7,pc1,pc2,pc3,pc4,mb8); block9: dmix port map(ma9,ma10,ma11,ma12,pc1,pc2,pc3,pc4,mb9);block10: dmix port map(ma10,ma11,ma12,ma9,pc1,pc2,pc3,pc4,mb10);block11: dmix port map(ma11,ma12,ma9,ma10,pc1,pc2,pc3,pc4,mb11);block12: dmix port map(ma12,ma9,ma10,ma11,pc1,pc2,pc3,pc4,mb12); block13: dmix port map(ma13,ma14,ma15,ma16,pc1,pc2,pc3,pc4,mb13);block14: dmix port map(ma14,ma15,ma16,ma13,pc1,pc2,pc3,pc4,mb14);block15: dmix port map(ma15,ma16,ma13,ma14,pc1,pc2,pc3,pc4,mb15);block16: dmix port map(ma16,ma13,ma14,ma15,pc1,pc2,pc3,pc4,mb16); end architecture;
entity fround is port( aa1,aa2,aa3,aa4,aa5,aa6,aa7,aa8,aa9,aa10,aa11,aa12,aa13,aa14,aa15,aa16,k1,k2,k3,k4,k5,k6,k7,k8,k9,k10,k11,k12,k13,k14,k15,k16
lxxxvii
: in large_int ; b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,b11,b12,b13,b14,b15,b16:out large_int);end entity;architecture struct of fround is
signal tap1,tap2,tap3,tap4,tap5,tap6,tap7,tap8,tap9,tap10,tap11,tap12,tap13,tap14,tap15,tap16:large_int;signal ttap1,ttap2,ttap3,ttap4,ttap5,ttap6,ttap7,ttap8,ttap9,ttap10,ttap11,ttap12,ttap13,ttap14,ttap15,ttap16:large_int;signal tttap1,tttap2,tttap3,tttap4,tttap5,tttap6,tttap7,tttap8,tttap9,tttap10,tttap11,tttap12,tttap13,tttap14,tttap15,tttap16:large_int;
port( sa1,sa2,sa3,sa4,sa5,sa6,sa7,sa8,sa9,sa10,sa11,sa12,sa13,sa14,sa15,sa16: in large_int; sb1,sb2,sb3,sb4,sb5,sb6,sb7,sb8,sb9,sb10,sb11,sb12,sb13,sb14,sb15,sb16: out large_int);end component;
entity dround is port( aa1,aa2,aa3,aa4,aa5,aa6,aa7,aa8,aa9,aa10,aa11,aa12,aa13,aa14,aa15,aa16,k1,k2,k3,k4,k5,k6,k7,k8,k9,k10,k11,k12,k13,k14,k15,k16,pc1,pc2,pc3,pc4 : in large_int ; b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,b11,b12,b13,b14,b15,b16:out large_int);end entity;architecture struct of dround is
signal tap1,tap2,tap3,tap4,tap5,tap6,tap7,tap8,tap9,tap10,tap11,tap12,tap13,tap14,tap15,tap16:large_int;signal ttap1,ttap2,ttap3,ttap4,ttap5,ttap6,ttap7,ttap8,ttap9,ttap10,ttap11,ttap12,ttap13,ttap14,ttap15,ttap16:large_int;signal tttap1,tttap2,tttap3,tttap4,tttap5,tttap6,tttap7,tttap8,tttap9,tttap10,tttap11,tttap12,tttap13,tttap14,tttap15,tttap16:large_int;
port( sa1,sa2,sa3,sa4,sa5,sa6,sa7,sa8,sa9,sa10,sa11,sa12,sa13,sa14,sa15,sa16: in large_int; sb1,sb2,sb3,sb4,sb5,sb6,sb7,sb8,sb9,sb10,sb11,sb12,sb13,sb14,sb15,sb16: out large_int);end component;
entity daes is port( aa1,aa2,aa3,aa4,aa5,aa6,aa7,aa8,aa9,aa10,aa11,aa12,aa13,aa14,aa15,aa16,k1,k2,k3,k4,k5,k6,k7,k8,k9,k10,k11,k12,k13,k14,k15,k16,RCpointer,pc1,pc2,pc3,pc4 : in large_int ; b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,b11,b12,b13,b14,b15,b16:out large_int);end entity;architecture struct of daes issignal ap1,ap2,ap3,ap4,ap5,ap6,ap7,ap8,ap9,ap10,ap11,ap12,ap13,ap14,ap15,ap16:large_int;signal tap1,tap2,tap3,tap4,tap5,tap6,tap7,tap8,tap9,tap10,tap11,tap12,tap13,tap14,tap15,tap16:large_int;
signal sap1,sap2,sap3,sap4,sap5,sap6,sap7,sap8,sap9,sap10,sap11,sap12,sap13,sap14,sap15,sap16:large_int;signal aap1,aap2,aap3,aap4,aap5,aap6,aap7,aap8,aap9,aap10,aap11,aap12,aap13,aap14,aap15,aap16:large_int;
xc
signal bap1,bap2,bap3,bap4,bap5,bap6,bap7,bap8,bap9,bap10,bap11,bap12,bap13,bap14,bap15,bap16:large_int;signal cap1,cap2,cap3,cap4,cap5,cap6,cap7,cap8,cap9,cap10,cap11,cap12,cap13,cap14,cap15,cap16:large_int;signal dap1,dap2,dap3,dap4,dap5,dap6,dap7,dap8,dap9,dap10,dap11,dap12,dap13,dap14,dap15,dap16:large_int;signal fap1,fap2,fap3,fap4,fap5,fap6,fap7,fap8,fap9,fap10,fap11,fap12,fap13,fap14,fap15,fap16:large_int;signal gap1,gap2,gap3,gap4,gap5,gap6,gap7,gap8,gap9,gap10,gap11,gap12,gap13,gap14,gap15,gap16:large_int;signal hap1,hap2,hap3,hap4,hap5,hap6,hap7,hap8,hap9,hap10,hap11,hap12,hap13,hap14,hap15,hap16:large_int;signal wpk1,wpk2,wpk3,wpk4,wpk5,wpk6,wpk7,wpk8,wpk9,wpk10: keyarray;signal RC2,RC3,RC4,RC5,RC6,RC7,RC8,RC9,RC10,RC11:large_int;
component keyscheduleport( a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,RCpointer : in large_int ;owk : out keyarray ; RCp: out large_int);end component;