UNIVERSITY OF CALIFORNIA, IRVINE The Advanced Encryption Standard Mapping into MorphoSys Architecture THESIS submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in Electrical and Computer Engineering by Ye Tang Thesis Committee: Professor Nader Bagherzadeh, Chair Professor Fadi J. Kurdahi Professor Stephen F. Jenks 2001
123
Embed
UNIVERSITY OF CALIFORNIA, IRVINEnewport.eecs.uci.edu/~ytang/academic/thesis.pdfUNIVERSITY OF CALIFORNIA, IRVINE The Advanced Encryption Standard Mapping into MorphoSys Architecture
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
UNIVERSITY OF CALIFORNIA, IRVINE
The Advanced Encryption Standard Mapping into MorphoSys Architecture
THESIS
submitted in partial satisfaction of the requirements for the degree of
MASTER OF SCIENCE
in Electrical and Computer Engineering
by
Ye Tang
Thesis Committee: Professor Nader Bagherzadeh, Chair
Professor Fadi J. Kurdahi Professor Stephen F. Jenks
At the third or global level, there are buses that support inter-quadrant
connectivity (see Figure 1.5). These buses are also called express lanes and they run
across rows as well as columns. These lanes can supply data from any RC of a quadrant
8
to other four RCs in the same row/column but different quadrant. For example, the value
of RC(0,1)* can be put on the horizontal express lane (HE) and then got by RC(0,4),
RC(0,5), RC(0,6) and RC(0,7); or it can be put on the vertical express lane (VE) and then
got by RC(4,1), RC(5,1), RC(6,1) and RC(7,1). Thus, up to four cells in a row/column
may access the output value of any one of four cells in the same row/column of the
adjacent quadrant. Express lanes greatly enhance the global connectivity. Some irregular
communication patterns, that otherwise require extensive interconnections, can be
handled quite efficiently. For example, an eight-point butterfly in FFT is accomplished in
only three clock cycles, and the data movement in the AES algorithm implementation
largely depends on the express lanes.
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
Figure 1.5: Level 3 of interconnection network
* RC(x,y) means the RC located at row x, column y.
9
The L, M, R, T, C, B port in MUXA: L (Left), M (Middle), R (Right), T (Top),
C (Center), and B (Bottom) port of MUXA are all connected to other RCs within the
same quadrant. For example, these ports for RC X and Y are marked in Figure 1.6.
Notice that they do not always match their literal meanings.
L M
T
X R
C
B
T
C
B
Y L M R
Figure 1.6: L, M, R, T, C, B Port of MUXA
The L, U, D port of MUXB: L (Left), U (Up), and D (Down) port of MUXB are
defined by absolute location. They are not necessarily limited within a quadrant. For
example, these ports for RC X, Y, and Z are marked in Figure 1.7. Notice that they are
wrapped.
10
L
U
X
D
D
U
L Z
D
U
L Y
Figure 1.7: L, U, D Port of MUXB
1.2.3 Frame Buffer and DMA Controller
The high parallelism of the RC Array would be ineffective if the memory
interface is unable to transfer data at an adequate rate. Therefore, a high-speed memory
interface consisting of a streaming buffer (Frame Buffer) and a DMA controller is
incorporated in the system. The Frame Buffer has two sets as illustrated in Figure 1.8.
The communication between Frame Buffer and main memory is controlled by DMA
controller. By using the two sets of Frame Buffer alternatively, the computation of RC
Array and the data load and store of Frame Buffer are overlapped. Therefore, the memory
accesses are virtually transparent to RC Array.
11
BANK A
(64 x 8 bytes)
SET 0
SET 1
MSB
LSB
AA
AA
AA
AA
AA
AA
AA
AA
BB
BB
BB
BB
BB
BB
BB
BB
BANK B
(64 x 8 bytes)
.
.
.
.
.
.
.
.
.
.
.
.
Figure 1.8: Frame Buffer Block Diagram
1.2.4 Context Memory
The context memory stores configuration data, or contexts, for RC Array.
Contexts resemble the instructions for a microprocessor. But here, every context can
serve eight RCs in the same row or column simultaneously*.
As shown in Figure 1.9, Context Memory is logically organized into two blocks,
column context block (on the top) and row context block (on the left). Each block
consists of eight context sets, and each set consists of 16 context words.
A context word in the row context block (called row context word) is broadcast
on a row. And a context word in the column context block (called column context word)
is broadcast on a column. By picking up one corresponding word from each set in the
* That also indicates the coarse-grain nature (word-level operations) of MorphoSys architecture.
12
row/column context block, those 8 words (a plane) can cover the whole 8 rows/columns,
or the 64 RCs.
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC16
16
16
16
16
16
16
16
16
16 16 16 16 16 16 16
Figure 1.9: Structure of Context Memory
The total number of row/column contexts is referred as the depth of
programmability. Because there are 16 words in a set, there are 16 row contexts and 16
column contexts in total. This means the depth of programmability is 32. In other words,
RC Array can perform up to 32 different operations without reloading new contexts.
This depth is enough for a lot of DSP and image processing applications.
However, it is not enough for some complicated algorithms. Because the penalty to
reload new contexts during application is large, a reasonable way is to increase the
context memory size. In M2, the next version MorphoSys, the depth will be increased to
256.
13
1.2.5 TinyRISC
Figure 1.10 shows the block diagram of TinyRISC. Since most target applications
involve some sequential processing, a RISC processor, TinyRISC [6], is included in the
system.
Fetch Stage
ProgramCounter
BranchUnit
ALU
ShiftUnit
MemoryUnit
MorphoSysUnit
Execute StageDecode Stage Write-Back Stage
ClockDriver
RegisterFi le Data Cache Core
Figure 1.10: TinyRISC block diagram
This is a MIPS-like processor with a 4-stage scalar pipeline. It has a 32-bit ALU,
register file and an on-chip data cache memory. This processor also coordinates system
operation and controls its interface with the external world. This is made possible by
some specific instructions (besides the standard RISC instructions) to the TinyRISC
Instruction Set Architecture (ISA). These instructions are called MorphoSys instructions.
They can initiate data transfers between main memory and MorphoSys components, and
control the execution of the RC Array.
14
These MorphoSys instructions are listed in Table 1.2. There are two major
categories of these instructions: DMA instructions and RC Array instructions.
Table 1.2: MorphoSys Instructions
Mnemonic Description of Operation
LDCTXT Load Context from Main Memory into Context Memory.
LDFB Load data from Main Memory into Frame Buffer.
STFB Store data into Main Memory from Frame Buffer.
CBCAST Context broadcast, no data from Frame Buffer.
DBCBC Column context broadcast, get data from both banks of Frame Buffer.
DBCBR Row context broadcast, get data from both banks of Frame Buffer.
DBCB Context broadcast, get data from both banks of Frame Buffer.
SBCB Context broadcast, transfer 128 bit data from Frame Buffer.
WFB Write the processed data back into Frame Buffer with indirect Address.
WFBI Write the processed data back into Frame Buffer with immediate address.
RCRISC Write one 16-bit data from RC Array into TinyRISC.
The DMA instructions contain fields that provide the DMA Controller with
adequate information, such as starting address in main memory, starting address in Frame
Buffer or Context Memory, number of bytes to load, load or store control, etc. This
enables the transfer of data between main memory and Frame Buffer or Context Memory
through the DMA Controller.
15
The RC Array instructions have fields that provide control signals to the RC
Array and Context Memory. This is essential to enable the execution of computations in
the RC Array. This information includes the contexts to be executed, the mode of context
broadcast (row or column), location of data to be loaded in from Frame Buffer, etc.
1.3 Modifications to MorphoSys
In the implementation of M2, some modifications to MorphoSys architecture are
proposed, including memory size expansion and architectural revamping of the RC. The
modifications that have impact on the implementation of AES are briefly mentioned
below.
1.3.1 Size Expansion of Register File and Context Memory
To make RC capable of more complicated algorithms, 8 registers (instead of 4)
will be included in the register file. The size of context memory will be increased to be
able to store 256 context planes instead of 32. These upgrades are critical to the
implementation effectiveness of some complex algorithms, such as AES, FFT, Reed
Solomon Codes, and so on. Specifically, AES uses 7 registers and 27 contexts for
encryption, and 8 registers and 28 contexts for decryption. Notice that the numbers of
contexts mentioned here are only for AES’s data processing part. Besides, its
initialization part needs more than 500 contexts for loading two tables, 256 bytes each.
Since these tables are only loaded once in a session, it is acceptable to repeatedly load
them into a small-size context memory. So a context memory with the capability of
storing 32 contexts is enough for AES. However, the increase of the number of registers
is necessary to achieve high-speed implementation of AES.
16
1.3.2 Embedded Lookup Table in Every RC
Lookup operation is common in quite a few algorithms. For AES, it is the most
important operation (see Chapter 2). To achieve high computing parallelism, M2 will
embed a 512-byte lookup table in each RC. This table will be implemented by SRAM.
1.3.3 New RC Array Instructions
To access the lookup table in every RC, two new RC Array instructions,
“LDMM” and “STMM”, are added to the instruction set. For example, “LDMM r1 > 5”
means loading the value of table element (memory) at address r1 into register r5; “STMM
r5 > 1” means storing the value of register r5 into the table element (memory) at address
r1.
17
Chapter 2
The Advanced Encryption Standard (AES)
Advanced Encryption Standard (AES) is the new encryption standard that is
expected to replace the current standard, Data Encryption Standard (DES) and Triple
DES. The National Institute of Standards and Technology (NIST) worked with industry
and public cryptographic community to develop the AES [7]. A comprehensive overview
of AES and its algorithm is described in this chapter.
2.1 Introduction of the AES
After more than three years’ work, NIST recently announced Rijndael as the AES
algorithm. The development of AES and the nature of Rijndael algorithm are briefly
introduced in this section.
2.1.1 History of the AES Development
The AES development was launched by NIST on Jan 2, 1997. On August 20,
1998, NIST selected fifteen algorithms as candidates for tests. After the comprehensive
analysis and public comments by the global cryptographic community, five algorithms
were selected from them as the AES finalist in April 1999. They were MARS, RC6,
Rijndael, Serpent, and Twofish. Then, after two rounds of further public analysis, NIST
announced on October 2, 2000 that Rijndael has been selected for the AES. Four months
after the announcement, NIST finished a draft Federal Information Processing Standard
(FIPS) for the AES and asked for public review and comment [8]. The comment period
18
ended on May 29, 2001. According to NIST’s schedule, the formal standard is to be
published by the summer of 2001.
2.1.2 Overview of Rijndael
Rijndael is a symmetric block cipher developed by two Belgium cryptology
experts, Joan Daemen and Vincent Rijmen. The pronunciation of Rijndael could be like
"Reign Dahl", "Rain Doll", or "Rhine Dahl", according to its authors’ suggestion.
Rijndael can apply to data blocks of 128 bits, using cipher keys with lengths of
128, 192, and 256 bits*. Rijndael's combination of security, performance, efficiency, ease
of implementation and flexibility make it an appropriate selection for the AES.
Specifically, Rijndael has very good performance in both hardware and software
across a wide range of computing. Its initialization time is short, and its key agility is
good. Rijndael's very low memory requirements make it very well suited for restricted-
space environments, in which it also demonstrates excellent performance. Rijndael's
operations are among the easiest to defend against power and timing attacks [9][10].
Additionally, Rijndael's internal round structure appears to have good potential to benefit
from instruction-level parallelism (ILP). It is the ILP characteristic of Rijndael that
stimulates the research of its implementation into MorphoSys architecture.
For all kinds of information about Rijndael, you may want to begin from the
website maintained by its authors: http://www.esat.kuleuven.ac.be/~rijmen/rijndael/.
* In Fact, Rijndael can handle any combination of Key size and block size from 128, 192, and 256 bits. But in the AES, the block size is fixed at 128 bits to be more easily accommodated by many types of block cipher design.
19
2.1.3 Definition of Terms, Parameters and Functions
The terms, parameters, and functions used by AES are defined in the following
two tables. They conform to the convention used by the draft FIPS.
Table 2.1: Terms and Acronyms Used in AES
Term Explanation
Block Sequence of binary bits that comprise the input, output, State, and Round Key. The length of a block is the number of bits it contains. For AES, the block length is 128 bits.
Byte A group of eight bits that is treated either as a single entity or as an array of 8 individual bits.
Cipher Series of transformations that converts plaintext to ciphertext using the Cipher Key.
Cipher Key Secret, cryptographic key that is used by the Key Expansion routine to generate a set of Round Keys; can be pictured as a rectangular array of bytes, having four rows and Nk columns.
Ciphertext Data output from the Cipher or input to the Inverse Cipher.
Inverse Cipher Series of transformations that converts ciphertext to plaintext using the Cipher Key.
Key Expansion Routine used to generate a series of Round Keys from the Cipher Key.
Plaintext Data input to the Cipher or output from the Inverse Cipher.
Round Key Round Keys are values derived from the Cipher Key using the Key Expansion routine; they are applied to the State in the Cipher and Inverse Cipher.
State Intermediate Cipher result that can be pictured as a rectangular array of bytes, having four rows and Nb columns.
S-box Non-linear substitution table used in several byte substitution of a byte value.
Word A group of 32 bits that is treated either as a single entity or as an array of 4 bytes.
20
Table 2.2: Parameter and Functions Used in AES
AddRoundKey( ) Transformation in the Cipher and Inverse Cipher in which a Round Key is added to the State using an XOR operation. The length of a Round Key equals the size of the State (128 bits, or 16 bytes).
SubBytes( ) Transformation in the Cipher that processes the State using a non-linear byte substitution table (S-box) that operates on each of the State bytes independently.
ShiftRows( ) Transformation in the Cipher that processes the State by cyclically shifting the last three rows of the State by different offsets.
MixColumns( ) Transformation in the Cipher that takes all of the columns of the State and mixes their data (independently of one another) to produce new columns.
InvSubBytes( ) Transformation in the Inverse Cipher that is the inverse of SubBytes( ).
InvShiftRows( ) Transformation in the Inverse Cipher that is the inverse of ShiftRows( ).
InvMixColumns( ) Transformation in the Inverse Cipher that is the inverse of MixColumns( ).
RotWord( ) Function used in the Key Expansion routine that takes a 4-byte word and performs a cyclic permutation.
SubWord( ) Function used in the Key Expansion routine that takes a 4-byte input word and applies an S-box to each of the 4 bytes to produce an output word.
Nb Number of columns (32-bit words) comprising the State. For AES, Nb = 4.
Nk Number of 32-bit words comprising the Cipher Key. For AES, Nk = 4, 6, or 8.
Nr Number of rounds, which is a function of Nk and Nb (which is fixed). For AES, Nr = 10, 12, or 14.
2.2 Mathematical Background of Rijndael
Before looking into the algorithm of Rijndael, it is helpful to understand the
mathematical basis used by it. In this section, the necessary mathematical concepts are
introduced, and some simple examples are given.
21
2.2.1 Polynomial Representation of A Finite Field Element
The basic processing unit in Rijndael is a byte, which can be represented as a
group of eight contiguous bits:
{ }01234567 ,,,,,,, bbbbbbbb where 1or 0=ib
Furthermore, it can be interpreted as finite field elements using a polynomial
representation [11]:
0
01
12
23
34
45
56
67
7 xbxbxbxbxbxbxbxb +++++++
For example, { 10011100} identifies the following specific finite field element:
2347 xxxx +++
To simplify the representation, hexadecimal notation is introduced. For example,
the above element { 10011100} can be represented as { 9C} , or simpler, ‘9C’.
Since the unit in Rijndael is a byte, all elements can be represented by two
hexadecimal digits. This kind of finite field is called GF(28). (GF stands for Galois Field.)
2.2.2 Addition in GF(28)
The addition of two elements is a polynomial with coefficients that are given by
the sum modulo 2 of the corresponding coefficients of the two operands. For example,
‘9C’ + ‘26’ = ‘BA’
Or, with the polynomial representation:
)()()( 134571252347 xxxxxxxxxxxx ++++=++++++
Not surprisingly, the addition in GF(28) is actually a simple and fast bitwise XOR
operation. To verify it with the previous example,
* Usually the Cipher Key is not changed in one session of encryption/decryption. But theoretically, one can use several Cipher Keys within one session to achieve better security. In that case, each change of Cipher Key will introduce one Key Expansion.
32
Recall there are an initial Round Key addition, several intermediate rounds, and a
final round in total, the number of Round Keys should be equal to the number of rounds
plus 1. Because Nr = 10, 12, 14 for Nk = 4, 6, 8, respectively, the numbers of Round
Keys are 11, 13, 15, respectively.
The expansion processes the data at word level. The ith word, or W[i], includes
the (4* i)th, (4* i+1)th, (4* i+2)th, (4* i+3)th byte, or the ith column. For example, if Nk =
4, there are 4 words in the Cipher Key. And it would be expanded to 11*4 = 44 words, or
44*4*8 = 1408 bits.
The Rcon[ ] array in the code is a constant array listed in Appendix A.
As shown in the code, the first Nk words of the whole expanded Round Keys are
exactly the original Cipher Key. After that, the optimized expansion implemented in
hardware should be done by a number of loops because by this means the expanded
Round Keys can be calculated in place to save a lot of memory. Please refer to Section
3.2.1 for detailed information.
The result of Key Expansion is a bunch of words that should be partitioned into
(Nr+1) Round Keys. The partition is very simple: from the beginning, every 4 words
form a Round Key. Figure 2.5 shows the Round Key expansion and partition for Nk = 6.
As shown below, W0 to W5 form the original Cipher Key, but every Round Key contains
only 4 words.
W0 W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 …
Round Key 0 Round Key 1 Round Key 2 …
Figure 2.5: Key Expansion and Round Key Partition for Nk = 6
33
2.3.2 The Inverse Cipher
In the Inverse Cipher, each function is substituted by its inverse function, and the
order is reversed. The basic pseudo code is listed below.
KeyExpansion(CipherKey, RoundKey);
state = in;
// inverse of last round
AddRoundKey(state); // use RoundKey[Nr]
InvShiftRows(state);
InvSubBytes(state);
// inverse of intermediate rounds
for ( round = Nr-1; round > 0; round --)
{
AddRoundKey(state); // inv addition = addition
InvMixColumns(state);
InvShiftRows(state);
InvSubBytes(state);
}
// inverse initial Round Key Addition
AddRoundKey(state); // use RoundKey[0]
out = state;
Figure 2.6: Basic Pseudo Code for the Cipher of Rijndael Algorithm
InvShiftRows( ) is defined as: Row 0 is not shifted; Row 1 is shifted over 3 byte;
Row 2 is shifted over 2 bytes; Row 3 is shifted over 1 bytes. So the positions of the bytes
in a State are changed like:
1 5 9 13 1 5 9 13
2 6 10 14 14 2 6 10
3 7 11 15 11 15 3 7
4 8 12 16 8 12 16 4
Figure 2.7: Transformation of InvShiftRows( )
34
InvSubBytes( ) is the byte substitution where the inverse table, inv S-box, is
applied. The inv S-box table is listed in Appendix A.
InvMixColumns( ) is similar to MixColumns( ). But it uses a different c(x), given
by
'0''09''0''0')( 23 ExxDxBxc +++=
The coefficients of this polynomial is larger than those of the polynomial used by
MixColumns( ), '02''01''01''03' 23 +++ xxx . So the speed of InvMixColumns( ) is slower
due to more xtime and XOR operations (see Section 2.3.1.3.)
There are some properties of these inverse functions that can be exploited to
derive a Cipher-like structure for the Inverse Cipher.
First, the order of InvShiftRows( ) and InvSubBytes( ) is indifferent. This is
because InvShiftRows( ) simply transposes the bytes and has no effect on the values, and
InvSubBytes( ) works on individual bytes, independent of their positions.
Second, the sequence
AddRoundKey(State, RoundKey);
InvMixColumn(State);
can be replaced by
InvMixColumn(State);
AddRoundKey(State, InvRoundKey);
where InvRoundKey is obtained by:
1. Apply the Key Expansion.
2. Apply InvMixColumn to all Round Keys except the first one and last one.
35
Notice that the basic pseudo code in Figure 2.6 can be represented by the
following sequence:
ASB AMSB AMSB … AMSB A
where A means AddRoundKey( ), S means InvShiftRows( ), B means
InvSubBytes( ), and M means InvMixColumns( ).
Using the two properties to change the order SB to BS, AM to MA, the sequence
becomes
ABS MABS MABS … MABS A
or equivalently
A BSMA BSMA … BSMA BSA
The last sequence is exactly the Cipher’s sequence. So, with the use of
InvRoundKey, the Inverse Cipher’s structure is the same as the Cipher’s. When AES is
mapped into MorphoSys, the Inverse Cipher uses right the same architecture as the
Cipher’s. Of course, the function InvShiftRows( ) and InvMixColumns( ) are slightly
different than ShiftRows( ) and MixColumns( ), and InvRoundKey replaces the
RoundKey.
36
Chapter 3
Mapping AES into MorphoSys
AES has already been widely implemented in different formats, such as
C/C++[13][14], Java[15], Visual Basic[16], Perl[17], Assembly[18], Ada[19], etc. It can
also be implemented by hardware, such as ASIC. MorphoSys is designed for applications
with inherent data-parallelism, high regularity, and high throughput requirement. Due to
the high data-parallelism in the AES algorithm, MorphoSys is able to implement it much
faster than those software implementations. Besides, because of the reconfigurability of
MorphoSys, the mapped AES algorithm can be part of a larger system.
In this chapter, several key features of MorphoSys that help the mapping of AES
are pointed out. Then, the complete mapping progress, including the Key Expansion by
TinyRISC processor, the data processing by RC Array, the Context/data loading and
storing, are discussed. At last, the simulation and results are introduced and analyzed.
3.1 Parallel Computing Exploration
Rijndael is a block cipher that includes a large amount of table lookup operations
and data movement, the actual ALU operation is just a very small part in terms of
running time or number of instructions. So how to input/output the blocks between Frame
Buffer and RC Array, to do the table-lookup operations, and to move the data among RCs
with the help of three layers of RC Array interconnection network are main concerns.
37
3.1.1 Multi-block Processing
Every data block in Rijndael has 16 bytes, while the number of RCs in the RC
Array is 64. Because there is no data dependency between any two data blocks,
MorphoSys has the capability to process 4 data blocks at the same time.
Because each block is a 4x4 matrix, it is very natural to partition the 4 blocks as
shown in Figure 3.1. However, because the data is column-wise stored in main memory
and Frame Buffer, this partitioning will introduce data reshuffle, which is very difficult to
realize in the Frame Buffer.
Block 0
(4x4)
Block 1
(4x4)
Block 2
(4x4)
Block 3
(4x4)
Figure 3.1: Intuitive Partitioning of RC Array
The actual partitioning used in the implementation is shown in Figure 3.2.
Block 0 (8x2)
Block 1 (8x2)
Block 2 (8x2)
Block 3 (8x2)
Figure 3.2: Actual Partitioning of RC Array
38
Under this partitioning, the data loading/storing process is straightforward. But
the data movement for ShiftRows( ) is not the same as in a 4x4 matrix. Please refer to
Section 3.1.3 for details about the data movement.
3.1.2 Parallel Table-lookup
In M2’s architecture, there is an embedded memory for each RC. This memory
behaves as a local lookup table. When a context commands a row/column to perform a
table-lookup operation, eight table-lookups are done in parallel. Furthermore, if the eight
contexts in a whole context plane all indicate table-lookup operations, 64 table-lookups
are done in parallel. On the other hand, in a software implementation of Rijndael, the
table-lookup operation can only be done one by one. That is significantly slower than the
implementation in MorphoSys.
3.1.3 Dedicated Data Movement for Rijndael
Recall the data movement for ShiftRows( ). The new position of every byte is
shown in Figure 3.3.
1 5 9 13 1 5 9 13
2 6 10 14 6 10 14 2
3 7 11 15 11 15 3 7
4 8 12 16 16 4 8 12
Figure 3.3: Transformation of ShiftRows( ) in 4x4 Matrix
Before moving the data according to ShiftRows( ), one needs to be aware what
data is needed in the subsequent function MixColumns( ). MixColumn( ) is a “column”
function, which means a byte will only need the value of all the four bytes (including
39
itself) in the same column for the transformation. For example, the highlighted byte at
position 10 will need the values of the bytes at the same column marked by 5, 10, 15, and
4 to do MixColumn( ).
In MorphoSys, a block is partitioned into 8x2 matrix, and every RC stores a byte.
So the ShiftRows( ) will move the data as following.
1 9 1 9
2 10 6 14
3 11 11 3
4 12 16 8
5 13 5 13
6 14 10 2
7 15 15 7
8 16 4 12
Figure 3.4: Transformation of ShiftRows( ) in 8x2 Matrix
To make every RC do MixColumns( ) independently and simultaneously, it is
desirable to have every RC store four relevant values used by MixColumns( ) into its
local registers. For example, because RC(5,0)* will use the input value in RC(4,0),
RC(5,0), RC(6,0), and RC(7,0) for MixColumns( ), it should store them into its local
registers.
Figure 3.5 shows the data movement result for ShiftRows( ). After the move, each
RC will contain the shifted data as well as the relevant data for MixColumns( ). Notice
that only two columns of RCs are shown here. The other six columns of RCs (other three
* Assume we only consider block 0 here. The corresponding RCs in other three blocks are RC(5,2), RC(5,4), and RC(5,6).
40
blocks) apply the same move. And the order of the bytes saved in four registers are not
important. As shown later, the order is not exactly the same as Figure 3.5. It merely
depends on the ease of implementation.
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 1 – – – 9 – – – 1 6 11 16 9 14 3 8
Row 1 2 – – – 10 – – – 1 6 11 16 9 14 3 8
Row 2 3 – – – 11 – – – 1 6 11 16 9 14 3 8
Row 3 4 – – – 12 – – – 1 6 11 16 9 14 3 8
Row 4 5 – – – 13 – – – 5 10 15 4 13 2 7 12
Row 5 6 – – – 14 – – – 5 10 15 4 13 2 7 12
Row 6 7 – – – 15 – – – 5 10 15 4 13 2 7 12
Row 7 8 – – – 16 – – – 5 10 15 4 13 2 7 12
Figure 3.5: Data Movement for ShiftRows( )
The detailed data movement illustration and algorithm for encryption/decryption
are discussed in Section 3.2.4.
3.2 Algorithm Flowchart and Illustration
The whole algorithm can be divided into two parts: sequential part and parallel
part. The sequential part includes Key Expansion, and is done by TinyRISC. The parallel
part includes loading lookup tables, loading Round Keys, loading data, processing data,
and storing data. It is done by RC Array.
41
The complete flowchart is shown in Figure 3.6. And the implementation of each
block is discussed in the following sections.
Key Expansion by TinyRISCStore the result - Round Keys
into main memory
Table LoadingLoad xtime and S-box (or inv S-box)
table into every RC
Data and Round Key LoadingLoad four data blocks and currently-
needed Round Key from main memoryto Frame Buffer, then to RC Array
Data Encryption/DecryptionPerform the multiple-round
cipher or inverse cipher in RC Array
Data StoringStore four data blocks from RC Array
to Frame Buffer, then to main memory
End of Data?No
Yes
End
Figure 3.6: Flowchart of Rijndael Implementation in MorphoSys
42
3.2.1 Key Expansion by TinyRISC
The pseudo code for Key Expansion has been discussed in Section 2.3.1.5. In
order to reduce the number of registers used in TinyRISC, the assembly code uses loop
structure: Nk words are generated in each loop, until the total number reaches the desired
number (of words). For example, if Nk = 4, the total number of words in all Round Keys
is 4*11 = 44, so the total number of loops is � � 114/44 = ; if Nk = 6, the total number is
4*13 = 52, so the number of loops is � � 96/52 = ; if Nk = 8, the total number is 4*15 =
60, so the number of loops is � � 88/60 = . The indivisibility when Nk ≠ 4 means more
than necessary words would be generated during the expansion. The extra words can
simply be discarded.
In the Inverse Cipher, an additional InvMixColumns( ) function is applied to
every Round Key except the first and last one.
Because the main memory and TinyRISC are 32-bit, the expanded Round Keys
are also 32-bit. But this format cannot be used by Frame Buffer, which expects 16-bit
inputs. For example, when Frame Buffer reads a 32-bit word 0x00000064 from main
memory, it will treat it as two numbers: 0x0000 and 0x0064. So the result needs 2-to-1
concatenations: after all Round Keys are generated and stored back into main memory as
32-bit format, they will be loaded into TinyRISC again, with two 32-bit words each time,
and concatenated to one 32-bit word, then stored back into main memory.
0x000000eb 0x00eb003d
0x0000003d ( next concat enat i on)
Figure 3.7: Concatenations of Round Keys
43
3.2.2 Table Loading
Three types of contexts are need for loading each table element. They are:
set 0, 0 LDI M! 5 def def > 0; # l oad val ue 5 i nt o RC’ s r 0
set 0, 15 STMM r 0 def > 1; # st or e r 0 i nt o t abl e addr ess r 1
set 8, 15 ADD r 1 r 2 > 1; # i ncr ease r 1 by 1 ( r 2)
Notice that once STMM and ADD* are loaded into Context Memory, they can be
used for every table element. So theoretically, the total number of contexts to load two
initial value) = 516. But the size of Context Memory is not big enough to save all 516
contexts. In M2, the Context Memory can save up to 256 contexts. Since STMM, ADD,
and initialization contexts are needed once in every 256 contexts, the pattern of contexts
should be:
1st l oadi ng: 252 LDI Ms + 1 STMM + 1 ADD + 2 I ni t i al i zat i on
2nd l oadi ng: 252 LDI Ms + 1 STMM + 1 ADD + 2 I ni t i al i zat i on
3r d l oadi ng: 8 LDI Ms + 1 STMM + 1 ADD + 2 I ni t i al i zat i on
The total number of contexts is 256*2 + 12 = 524.
At the time the author simulated the implementation of AES, the simulator was
only able to handle up to 32 contexts (i.e., M1’s structure). So there are 18 times of table
loading instead of 3. But in any case it is not a big issue – the table loading is done only
once during the initialization.
There are two tables to be loaded: xtime and S-box (or inv S-box). One of them is
from address 0x00 to 0xFF, and another is from address 0x100 to 0x1FF. As shown later,
to access the second table, an extra context to add the offset 0x100 is needed for every
* All RC Array instructions are listed in Appendix C.
44
table lookup operation. Because xtime table is used more frequently (see next section), it
is reasonable to load it first.
3.2.3 Data and Round Key Loading
Four blocks, or 64 bytes of data, and the currently needed Round Key (16 bytes)
are loaded from main memory into Frame Buffer, then into RC Array. Because the four
blocks use the same Round Key, the Round Key will be repeatedly loaded from Frame
Buffer to RC Array for four times. The involved instructions are LDFB and SBCB.
3.2.4 Data Processing in RC Array
After the data and Round Key have been loaded into RC Array, the next thing is
to process data in RC Array. As stated in Chapter 2, the process includes four functions:
SubBytes( ), ShiftRows( ), MixColumns( ), and AddRoundKey( ).
The contexts for SubBytes( ) are very simple:
set 0, 3 ADD r 0 r 1 > 0; # r 1 i s const ant 0x0100
set 0, 4 LDMM r 0 def > 0; # l oad i nt o r 0
The first context is to add offset 0x100 to index register r0. The second context is
to load table element at address [r0+0x100] into r0. So the result is r0 = S-box(r0) (or inv
S-box(r0)).
The context for AddRoundKey( ) is also very simple:
set 8, 0 XOR r 0 r 7 > 0; # RoundKey i s saved i n r 7
However, the contexts for ShiftRows( ) and MixColumns are more complicated.
ShiftRows( ) includes eight steps of data movement, and MixColumns( ) mainly consists
of xtime and XOR operations.
45
The data movement and contexts for ShiftRows( ) are illustrated in several
figures.
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 1 – – – 9 – – – 1 – – – 9 – – –
Row 1 2 – – – 10 – – – 2 – – – 10 – – –
Row 2 3 – – – 11 – – – 3 – – – 11 – – –
Row 3 4 – – – 12 – – – 4 5 – – 12 13 – –
Row 4 5 – – – 13 – – – 5 – – – 13 – – –
Row 5 6 – – – 14 – – – 6 1 – – 14 9 – –
Row 6 7 – – – 15 – – – 7 – – – 15 – – –
Row 7 8 – – – 16 – – – 8 – – – 16 – – –
31
40
51
00 , rrrr →→ Expr ess Lane, Row Mode
set 8 , 1 BYPASS r0 def > 0 WE ;
set 9 , 1 BYPASS r0 def > 0 ;
set 10 , 1 BYPASS r0 def > 0 ;
set 11 , 1 BYPASS VE def > 1 ;
set 12 , 1 BYPASS r0 def > 0 WE ;
set 13 , 1 BYPASS VE def > 1 ;
set 14 , 1 BYPASS r0 def > 0 ;
set 15 , 1 BYPASS r0 def > 0 ;
Figure 3.8: ShiftRows( ) Step 1
Figure 3.8 shows the first step of ShifRows( ). The contexts are in Row Mode,
which means one context for one row. Row 0/4 will put the data in r0 onto Express Lane,
and Row 3/5 will get the data from corresponding vertical Express Lane and save into
ikr means rk in
Row i
46
register r1. By this means, the value at position 1 and 5 is transferred to desired positions.
In this step, Row 1, 2, 6, and 7 are doing NOP operations.
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 1 – – – 9 – – – 1 – – – 9 – – –
Row 1 2 – – – 10 – – – 2 7 – – 10 15 – –
Row 2 3 – – – 11 – – – 3 – – – 11 – – –
Row 3 4 5 – – 12 13 – – 4 5 – – 12 13 – –
Row 4 5 – – – 13 – – – 5 – – – 13 – – –
Row 5 6 1 – – 14 9 – – 6 1 – – 14 9 – –
Row 6 7 – – – 15 – – – 7 – – – 15 – – –
Row 7 8 – – – 16 – – – 8 3 – – 16 11 – –
11
60
71
20 , rrrr →→ Expr ess Lane, Row Mode
set 8 , 2 BYPASS r0 def > 0 ;
set 9 , 2 BYPASS VE def > 1 ;
set 10 , 2 BYPASS r0 def > 0 WE ;
set 11 , 2 BYPASS r0 def > 0 ;
set 12 , 2 BYPASS r0 def > 0 ;
set 13 , 2 BYPASS r0 def > 0 ;
set 14 , 2 BYPASS r0 def > 0 WE ;
set 15 , 2 BYPASS VE def > 1 ;
Figure 3.9: ShiftRows( ) Step 2
Figure 3.9 shows the second step. It is similar to the first step, but moves different
data into desired positions.
ikr means rk in
Row i
47
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 1 – – – 9 – – – 1 – – – 9 – – –
Row 1 2 7 – – 10 15 – – 2 7 10 15 10 15 2 7
Row 2 3 – – – 11 – – – 3 – – – 11 – – –
Row 3 4 5 – – 12 13 – – 4 5 – – 12 13 – –
Row 4 5 – – – 13 – – – 5 – – – 13 – – –
Row 5 6 1 – – 14 9 – – 6 1 – – 14 9 – –
Row 6 7 – – – 15 – – – 7 – – – 15 – – –
Row 7 8 3 – – 16 11 – – 8 3 16 11 16 11 8 3
13
01
03
11
12
00
02
10 , | , rrrrrrrr →→→→ Lef t / Ri ght , Col umn Mode
set 0 , 7 BYPASS L def > 2 ;
set 1 , 7 BYPASS L def > 2 ;
set 2 , 7 BYPASS R def > 2 ;
set 3 , 7 BYPASS R def > 2 ;
set 4 , 7 BYPASS L def > 2 ;
set 5 , 7 BYPASS L def > 2 ;
set 6 , 7 BYPASS R def > 2 ;
set 7 , 7 BYPASS R def > 2 ;
Figure 3.10: ShiftRows( ) Step 3, 4
The third and fourth step use Column Mode. Column 2i will get data from
Column 2i+1 (i = 0, 1, 2, 3), and vice versa. Only one context plane is shown in Figure
3.10. Others are similar.
ikr means rk in
Column i
48
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 1 – – – 9 – – – 6 – – – 14 – – –
Row 1 2 7 10 15 10 15 2 7 6 7 10 15 14 15 2 7
Row 2 3 – – – 11 – – – 6 – – – 14 – – –
Row 3 4 5 – – 12 13 – – 6 5 – – 14 13 – –
Row 4 5 – – – 13 – – – 4 – – – 12 – – –
Row 5 6 1 – – 14 9 – – 4 1 – – 12 9 – –
Row 6 7 – – – 15 – – – 4 – – – 12 – – –
Row 7 8 3 16 11 16 11 8 3 4 3 16 11 12 11 8 3
3,2,1,00
50
7,6,5,40
30 , rrrr →→ Expr ess Lane, Row Mode
set 8 , 5 BYPASS VE def > 0 ;
set 9 , 5 BYPASS VE def > 0 ;
set 10 , 5 BYPASS VE def > 0 ;
set 11 , 5 BYPASS VE def > 0 WE ;
set 12 , 5 BYPASS VE def > 0 ;
set 13 , 5 BYPASS VE def > 0 WE ;
set 14 , 5 BYPASS VE def > 0 ;
set 15 , 5 BYPASS VE def > 0 ;
Figure 3.11: ShiftRows( ) Step 5
After four steps, all the seed data used for ShiftRows( ) and MixColumns( ) are
ready. Those seeds are highlighted in the left table in Figure 3.11. Then, the Express
Lanes are exploited again to store one byte into other four RCs at the same time. Here in
Step 5, the seed in register r0 of RC(3, i) and RC(5, i) are propagated through the Express
Lane and fetched by register r0 of RC(4-7, i) and RC (0-3, i), respectively.
ikr means rk in
Row i
49
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 6 – – – 14 – – – 6 1 16 11 14 9 8 3
Row 1 6 7 10 15 14 15 2 7 6 1 16 11 14 9 8 3
Row 2 6 – – – 14 – – – 6 1 16 11 14 9 8 3
Row 3 6 5 – – 14 13 – – 6 1 16 11 14 9 8 3
Row 4 4 – – – 12 – – – 4 5 10 15 12 13 2 7
Row 5 4 1 – – 12 9 – – 4 5 10 15 12 13 2 7
Row 6 4 – – – 12 – – – 4 5 10 15 12 13 2 7
Row 7 4 3 16 11 12 11 8 3 4 5 10 15 12 13 2 7
3,2,1,01
51
7,6,5,41
31 , rrrr →→ Expr ess Lane, Row Mode
3,2,1,02
72
7,6,5,42
12 , rrrr →→ Expr ess Lane, Row Mode
3,2,1,03
73
7,6,5,43
13 , rrrr →→ Expr ess Lane, Row Mode
set 8 , 7 BYPASS VE def > 1 ;
set 9 , 7 BYPASS VE def > 1 ;
set 10 , 7 BYPASS VE def > 1 ;
set 11 , 7 BYPASS VE def > 1 WE ;
set 12 , 7 BYPASS VE def > 1 ;
set 13 , 7 BYPASS VE def > 1 WE ;
set 14 , 7 BYPASS VE def > 1 ;
set 15 , 7 BYPASS VE def > 1 ;
Figure 3.12: ShiftRows( ) Step 6, 7, 8
Step 6, 7, and 8 are similar to Step 5. They will store the data from Express Lane
into register r1, r2, and r3, respectively. Only one context plane is shown above. Others
are similar. After these eight steps, every RC contains the data for MixColumns( ).
ikr means rk in
Row i
50
The algorithm for MixColumns( ) is listed below again for your convenience.
t mp = a0 ^ a1 ^ a2 ^ a3;
t m = a0 ^ a1; t m = xt i me( t m) ; a0 ^ = t m ^ t mp;
t m = a1 ^ a2; t m = xt i me( t m) ; a1 ^ = t m ^ t mp;
t m = a2 ^ a3; t m = xt i me( t m) ; a2 ^ = t m ^ t mp;
t m = a3 ^ a0; t m = xt i me( t m) ; a3 ^ = t m ^ t mp;
The distinct contexts for them are just “XOR” and “LDMM”. For example:
set 0 , 8 XOR r0 r4 > 4 ;
set 0 , 11 LDMM r5 def > 5 ;
So far all the functions for the Cipher have been discussed. After optimization, the
data processing part of the Cipher only uses 27 contexts in total.
In the Inverse Cipher, Function SubBytes( ) and AddRoundKey( ) are the same,
but InvShiftRows( ) and InvMixColumns( ) are slightly different. In InvShiftRows( ),
there are also eight steps of data more. And the only difference is the position of target
data. Figure 3.13 shows the first four steps for InvShiftRows( ).
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 1 – – – 9 – – – 1 – – – 9 – – –
Row 1 2 – – – 10 – – – 2 5 – – 10 13 – –
Row 2 3 – – – 11 – – – 3 – – – 11 – – –
Row 3 4 – – – 12 – – – 4 7 12 15 12 15 4 7
Row 4 5 – – – 13 – – – 5 – – – 13 – – –
Row 5 6 – – – 14 – – – 6 3 14 11 14 11 6 3
Row 6 7 – – – 15 – – – 7 – – – 15 – – –
Row 7 8 – – – 16 – – – 8 1 – – 16 9 – –
Figure 3.13: InvShiftRows( ) Step 1, 2, 3, 4
51
Figure 3.14 shows the next four steps. The highlighted bytes in left table are
seeds. They are propagated to four RCs through the Express Lanes.
Column 0 Column 1 Column 0 Column 1
r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3 r0 r1 r2 r3
Row 0 1 – – – 9 – – – 8 1 14 11 16 9 6 3
Row 1 2 5 – – 10 13 – – 8 1 14 11 16 9 6 3
Row 2 3 – – – 11 – – – 8 1 14 11 16 9 6 3
Row 3 4 7 12 15 12 15 4 7 8 1 14 11 16 9 6 3
Row 4 5 – – – 13 – – – 2 5 12 15 10 13 4 7
Row 5 6 3 14 11 14 11 6 3 2 5 12 15 10 13 4 7
Row 6 7 – – – 15 – – – 2 5 12 15 10 13 4 7
Row 7 8 1 – – 16 9 – – 2 5 12 15 10 13 4 7
Figure 3.14: InvShiftRows( ) Step 5, 6, 7, 8
The algorithm for InvMixColumns( ) is listed below. Due to more xtime and XOR
operations, the running time is increased a little bit. However, with very careful
arrangement of registers and table lookup, the total number of contexts for data
processing part of decryption is only increased by 1, or 28.
t m1 = a0 ^ a1 / / r 5 f or t m1, r i f or ai ( i = 0, 1, 2, 3)
t mp1 = t m1 ^ a2 / / r 6 f or t mp1, get r 5 bef or e i t i s dest r oyed
t m1 = xt i me( t m1) / / r 5 f or t m1, needs one l ookup cont ext C0
r 4 = a0 ^ t m1 / / r 5 i s f r ee and can be used agai n
t m2 = a0 ^ a2 / / r 5 f or t m2
t m2 = xt i me( xt i me( t m2) ) / / al l use t he same cont ext C0 as bef or e
r 4 = r 4 ^ t m2 / / r 5 i s f r ee agai n
t mp2 = t mp1 ^ a3 / / r 5 f or t mp2, swi t ch back t o r 5
r 4 = r 4 ^ t mp2 / / t mp2 = a0 ^ a1 ^ a2 ^ a3 her e
t mp2 = xt i me( xt i me( xt i me( t mp2) ) ) / / al l use cont ext C0 r 4 = r 4 ^ t mp2 / / r 4 saves t he r esul t of I nvMi xCol umns( )
52
3.2.5 Data Storing
After four data blocks are processed in RC Array, they are stored into Frame
Buffer, and then into main memory. The involved instructions are WFBI and STFB. If
there are more data to be encrypted/decrypted, the program will continue to process next
four blocks with the same procedure, until reaching the end of data.
The result saved in the main memory has the concatenated format. For example, a
32-bit word “0x00010002” means two bytes: “0x01” and “0x02” . To comply with the
same format as input*, which uses 32 bits to represent a byte, the result needs to be
separated. Using the same example, “0x00010002” will be separated as “0x00000001”
and “0x00000002”. This separation is performed after all the data have been
encrypted/decrypted.
3.3 Simulation Environment
MorphoSys group has developed a set of software to facilitate the algorithm
mapping, source code compilation, and algorithm simulation for M1. The complete set
of software includes Tcc, TRASM, MorphoSim, mView, mLoad, mSched, and
mULATE, as shown in Figure 3.15. Tcc is a C/C++ compiler that generates the
TinyRISC executable code. TRASM is an assembly compiler that generates the
TinyRISC executable code. MorphoSim is a VHDL simulator, which exactly matches the
MorphoSys chip. mLoad, mView, and mSched are used for context generation and
application scheduling. mULATE is a cycle-accurate simulator, which is more abstract
than MorphoSim.
* This consistency might be unnecessary. It depends on the specific application.
53
TR_appFor I=1 to 20X[I]=X[I]+1
TR_appFor I=1 to 20X[I]=X[I]+1
TinyRISCTinyRISC
RC ArrayRC Array
App. (C or Assembly Code)
C++,VHDL
MorphoSysChip
Tcc or TRASM
Z=RC_F(X)
W=RC_F(Y)
mLoad ContextLib.
mSchedmSchedExecutable
RC Arrayfunctions
MuLate,MorphoSim
mView
Conf igurat ioncontext
TR_appFor I=1 to 20X[I]=X[I]+1
TR_appFor I=1 to 20X[I]=X[I]+1
TinyRISCTinyRISC
RC ArrayRC Array
App. (C or Assembly Code)
C++,VHDL
MorphoSysChip
Tcc or TRASM
Z=RC_F(X)
W=RC_F(Y)
mLoad ContextLib.
mSchedmSchedExecutable
RC Arrayfunctions
MuLate,MorphoSim
mView
Conf igurat ioncontext
Figure 3.15: Software Tools for MorphoSys
To be compatible with the modifications in M2, all of these tools need to be
updated. Up to now, the mLoad*, mULATE†, and TRASM‡ have been updated. So the
author wrote and compiled the TinyRISC assembly code and contexts of the whole
algorithm, then used mULATE to simulate it.
3.4 Performance Analysis
A comprehensive simulation for the encryption and decryption under different
Key sizes is performed in mULATE. And the results are compared with those
implemented by assembly language, C/C++, Java, and ASIC/Programmable Logic cores.
* mLoad is the context compiler written in Perl. It was updated by the author. † mULATE was updated by Afshin Niktash. ‡ TRASM was updated by Afshin Niktash.
54
For the initialization part, other implementations may only need the Key
Expansion. However, for the MorphoSys implementation, it needs the Key Expansion,
lookup table loading, and context loading. Table 3.1 shows the numbers of cycles for the
Key Expansion implemented by ANSI C, C++, and MorphoSys TinyRISC*. In
MorphoSys implementation, the Key Expansion for the Inverse Cipher is much slower
because the InvMixColumns( ) operation is applied to each Round Key except the first
and last one, and the InvMixColumns( ) involves a lot of memory operations which need
a lot of cycles.
Table 3.1: # of Cycles for Key Expansion in Several Implementations
AES CD (ANSI C) Br ian Gladman (VC++) MorphoSys TinyRISC Key Size
The numbers of cycles for all three parts of the initialization in MorphoSys
implementation are listed in Table 3.2. It shows that the Cipher and Inverse Cipher may
need up to 10675 and 25671 cycles for the whole initialization, respectively. Assume M2
runs at 200MHz, it will take 54 µs and 128 µs, respectively. Obviously, this time is very
short and acceptable.
* The statistics for ANSI C and C++ is obtained from the AES proposal by Rijndael’s authors.
55
Table 3.2: # of Cycles for AES Initialization in MorphoSys Implementation
Key Size Key Expansion
Table Loading Context Loading
Total # of cycles
128 2770/13320 6249 230/238 9249/19807
192 3386/16029 6249 230/238 9865/22516
256 4196/19184 6249 230/238 10675/25671
* in “x/y” , “x” for encryption, “y” for decryption
For the data processing part, the numbers of cycles and/or throughputs for
encryption implemented by assembly language, C/C++, and Java are listed in Table 3.3.
All the throughputs (unit: Mb/s) are calculated at frequency 200 MHz.
Table 3.3: # of Cycles and Throughputs per Block in Other Implementations
Intel 8051 Motorola 68HC08
AES CD (ANSI C) Brain Gladman (VC++)
Java Key Size
# of cycles # of cycles # of cycles Xput # of cycles Xput # of cycles Xput
128 4065 8390 950 27.0 363 70.5 23000 1.1
192 4512 10780 1125 22.8 432 59.3 27600 0.93
256 5221 12490 1295 19.8 500 51.2 32300 0.79
* result for encryption only
The MorphoSys implementation result is listed in Table 3.4. Because each time
four blocks are processed in parallel, the actual number of cycles for one block is only
1/4 of the computing cycles. For example, when Key size is 128 bits, the data processing
part for encryption needs 601 / 4 = 150.25 cycles/block.
56
Table 3.4: # of Cycles and Throughputs per Block in MorphoSys Implementation
Encryption Decryption Key Size
# of cycles Xput # of cycles Xput
128 150.25 170.4 166 154.2
192 175.25 146.1 194.5 131.6
256 200.25 127.8 223 114.8
* in “a/b” , “a” for encryption, “b” for decryption
As shown in above tables, the running time for initialization is much longer than
that for one-block processing no matter how the AES is implemented. However, the
initialization is only a small fraction in total running time when the size of the data to be
processed is not very small. Assume the Key size is 128 bits, and the data size is 64K
Bytes, or 4K blocks, then MorphoSys needs to load the data to RC Array 1000 times. So
the total time for data processing part is 601,000 / 664,000 cycles for encryption /
decryption, and the time for initialization is only about 1.5% / 3% of the whole time.
On Aug 8, 2001, Amphion Semiconductor Ltd. [20] announced its application-
specific cores for AES applications. The performance of its CS 5210-5280 Family
(standard series) ASIC cores and programmable logic cores is shown in Table 3.5, 3.6
and 3.7. The ASIC cores are about 240% to 270% faster than the MorphoSys
implementation, and the programmable logic cores are also about 30% to 60% faster. But
several other issues should be considered when we compare their performance. First,
encryption and decryption need different Amphion cores; second, the initialization time
in Amphion cores is unknown (though this is usually not important); third, MorphoSys is
not just an ASIC or FPGA, and is capable of doing many other applications efficiently
with the same architecture.
57
Table 3.5: AES by Amphion ASIC Cores using TSMC 0.18µm Technology
Encryption Decryption Key Size
Logic Gates Timing Constraints
(MHz) Throughput
(Mb/s) Timing Constraints
(MHz) Throughput
(Mb/s)
128 18.2K 200 581 200 581
192 18.2K 200 492 200 492
256 18.2K 200 426 200 426
Table 3.6: AES by Amphion Programmable Logic Cores using Altera APEX20KE-1
Encryption Decryption Key Size
Logic Used (LE)*
Memory Used (ESB) Clock Speed
(MHz) Throughput
(Mb/s) Clock Speed
(MHz) Throughput
(Mb/s)
128 1452/1560 8 77.8 226 74.1 215
192 1452/1560 8 77.8 191 74.1 182
256 1452/1560 8 77.8 166 74.1 158
* encryption/decryption
Table 3.7: AES by Amphion Programmable Logic Cores using Xilinx VirtexE-8
Encryption Decryption Key Size
Logic Used
(LUT)*
Memory Used
(BRAM) Clock Speed (MHz)
Throughput (Mb/s)
Clock Speed (MHz)
Throughput (Mb/s)
128 1008/1092 4 92.3 268 86.7 254
192 1008/1092 8 92.3 227 86.7 213
256 1008/1092 8 92.3 196 86.7 184
* encryption/decryption
58
Figure 3.16 compares the data processing throughputs of C/C++, MorphoSys,
Amphion ASIC core, and Amphion FPGA cores implementation for encryption at Key
size = 128 bits. The throughput of MorphoSys implementation is close to the throughput
of Amphion Altera core implementation.
Figure 3.16: Throughputs of Different Implementations
3.5 Conclusions
The performance of the AES implementation in MorphoSys is satisfactory. The
throughput is more than 100Mb/s, which is usually adequate for applications on mobile
phones and PDAs. If in an application the throughput requirement is very stringent and
cannot met by a single MorphoSys, one can consider a larger scale of parallel computing
system consisting of several identical MorphoSys cores. Since there is no data
dependency among blocks, the “scaling up” is theoretically unlimited and will not
introduce any performance degradation that otherwise would exist if there were inter-
block data communications. Of course, in the real implementation, the MorphoSys chip
Throughputs of Different Implementations
2770.5
170.4
581
226268
0
100
200
300
400
500
600
700
ANSI C C++ MorphoSys ASIC Core Altera Core Xilinx Core
Mb/s
59
usually does not run the AES algorithm alone. It might be uneconomical if we increase
the number of MorphoSys cores just for the AES requirement.
Another possible approach to improve the performance is to include some
programmable logic block in MorphoSys, such as PLD/CPLD, to handle logic functions
and bit-level operations. But there might be a tradeoff between the flexibility and the
speed. Actually it is a research topic in the MorphoSys group.
60
Bibliography
[1] M. H. Lee, H. Singh, G. Lu, N. Bagherzadeh, F. J. Kurdahi, “Design and Implementation of the MorphoSys Reconfigurable Computing Processor” , Journal of VLSI Signal Processing Systems, vol. 24, pp. 164-172, March 2000
[2] H. Singh, M. H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, T. Lang, R. Heaton, and E. M. C. Filho, “MorphoSys: An Integrated Re-configurable Architecture,” NATO Symposium on Concepts and Integration, April 1998
[3] S. Brown and J. Rose, “Architecture of FPGAs and CPLDs: A Tutorial,” IEEE Design and Test of Computers, Vol. 13, No. 2, pp. 42-57, 1996
[4] G. Lu, “Modeling, Implementation and Scalability of the MorphoSys Dynamically Reconfigurable Computing Architecture,” Ph.D. Dissertation, 2000
[5] M. H. Lee, “Design and Implementation of the High-Performance Low-Power MorphoSys,” Ph.D. Dissertation, 2000
[6] A. Abnous, C. Christensen, J. Gray, J. Lenell, A. Naylor and N. Bagherzaheh, “Design and implementation of TinyRISC microprocessor,” Microprocessors and Microsystems, Vol.16, No.4, pp.187-94, 1992
[9] F. Koeune, J.-J. Quisquater, “A timing attack against Rijndael,” Technical Report CG-1999/1, UCL Crypto Group, Louvain-la-Neuve, 1999.
[10] E. Biham, A. Shamir, “Power Analysis of the Key Scheduling of the AES Candidates,” Proceedings of the Second Advanced Encryption Standard (AES) Candidate Conference, 1999.
[11] R. Lidl, H. Niederreiter, Introduction to finite fields and their applications, Cambridge University Press, 1986
[12] P. Barreto, V. Rijmen, Rijndael ANSI C Reference Code, downloadable at http://www.esat.kuleuven.ac.be/~rijmen/rijndael/rijndaelref.zip
wfbi 0, 0, 1, 0, 0 # store column #0 to Bank 1, Set 0, addr 0
wfbi 1, 0, 1, 0, 8 # column #1
wfbi 2, 0, 1, 0, 16
106
wfbi 3, 0, 1, 0, 24
wfbi 4, 0, 1, 0, 32
wfbi 5, 0, 1, 0, 40
wfbi 6, 0, 1, 0, 48
wfbi 7, 0, 1, 0, 56
# Store data from FB to Extenal Memory
ldi $6, 0x2000 # assume output data begins here
stfb $6, 1, 0, 32 # save the 64 bytes (32 words) data back to main memory, Bank 1, Set 0
.end main
D.3 Contexts for Data Processing
The contexts listed here are for the encryption (applicable to all Key sizes). The
contexts for decryption are similar and not listed here.
Column Contexts
set 0 , 0 BYPASS I I > 7 ; # Load Round Key set 1 , 0 BYPASS I I > 7 ; set 2 , 0 BYPASS I I > 7 ; set 3 , 0 BYPASS I I > 7 ; set 4 , 0 BYPASS I I > 7 ; set 5 , 0 BYPASS I I > 7 ; set 6 , 0 BYPASS I I > 7 ; set 7 , 0 BYPASS I I > 7 ; set 0 , 1 BYPASS I I > 0 ; # Load original data set 1 , 1 BYPASS I I > 0 ; set 2 , 1 BYPASS I I > 0 ; set 3 , 1 BYPASS I I > 0 ; set 4 , 1 BYPASS I I > 0 ; set 5 , 1 BYPASS I I > 0 ; set 6 , 1 BYPASS I I > 0 ; set 7 , 1 BYPASS I I > 0 ; set 0 , 2 LDIM!0x0100 def def > 1 ; set 1 , 2 LDIM!0x0100 def def > 1 ; set 2 , 2 LDIM!0x0100 def def > 1 ; set 3 , 2 LDIM!0x0100 def def > 1 ; set 4 , 2 LDIM!0x0100 def def > 1 ; set 5 , 2 LDIM!0x0100 def def > 1 ; set 6 , 2 LDIM!0x0100 def def > 1 ; set 7 , 2 LDIM!0x0100 def def > 1 ; set 0 , 3 ADD r0 r1 > 0 ; set 1 , 3 ADD r0 r1 > 0 ; set 2 , 3 ADD r0 r1 > 0 ; set 3 , 3 ADD r0 r1 > 0 ; set 4 , 3 ADD r0 r1 > 0 ;
107
set 5 , 3 ADD r0 r1 > 0 ; set 6 , 3 ADD r0 r1 > 0 ; set 7 , 3 ADD r0 r1 > 0 ; set 0 , 4 LDMM r0 def > 0 ; set 1 , 4 LDMM r0 def > 0 ; set 2 , 4 LDMM r0 def > 0 ; set 3 , 4 LDMM r0 def > 0 ; set 4 , 4 LDMM r0 def > 0 ; set 5 , 4 LDMM r0 def > 0 ; set 6 , 4 LDMM r0 def > 0 ; set 7 , 4 LDMM r0 def > 0 ; set 0 , 5 BYPASS L def > 3 ; set 1 , 5 BYPASS L def > 3 ; set 2 , 5 BYPASS R def > 3 ; # also Final Step 4 set 3 , 5 BYPASS R def > 3 ; set 4 , 5 BYPASS L def > 3 ; set 5 , 5 BYPASS L def > 3 ; set 6 , 5 BYPASS R def > 3 ; set 7 , 5 BYPASS R def > 3 ; set 0 , 6 BYPASS L def > 2 ; set 1 , 6 BYPASS L def > 2 ; # also Final Step 3 set 2 , 6 BYPASS R def > 2 ; set 3 , 6 BYPASS R def > 2 ; set 4 , 6 BYPASS L def > 2 ; set 5 , 6 BYPASS L def > 2 ; set 6 , 6 BYPASS R def > 2 ; set 7 , 6 BYPASS R def > 2 ; set 0 , 7 XOR r0 r4 > 4 ; set 1 , 7 XOR r0 r4 > 4 ; set 2 , 7 XOR r0 r4 > 4 ; set 3 , 7 XOR r0 r4 > 4 ; set 4 , 7 XOR r0 r4 > 4 ; set 5 , 7 XOR r0 r4 > 4 ; set 6 , 7 XOR r0 r4 > 4 ; set 7 , 7 XOR r0 r4 > 4 ; set 0 , 8 XOR r2 r4 > 4 ; set 1 , 8 XOR r2 r4 > 4 ; set 2 , 8 XOR r2 r4 > 4 ; set 3 , 8 XOR r2 r4 > 4 ; set 4 , 8 XOR r2 r4 > 4 ; set 5 , 8 XOR r2 r4 > 4 ; set 6 , 8 XOR r2 r4 > 4 ; set 7 , 8 XOR r2 r4 > 4 ; set 0 , 9 XOR r3 r4 > 4 ; set 1 , 9 XOR r3 r4 > 4 ; set 2 , 9 XOR r3 r4 > 4 ; set 3 , 9 XOR r3 r4 > 4 ; set 4 , 9 XOR r3 r4 > 4 ; set 5 , 9 XOR r3 r4 > 4 ; set 6 , 9 XOR r3 r4 > 4 ; set 7 , 9 XOR r3 r4 > 4 ; set 0 , 10 LDMM r5 def > 5 ;