90339847-aes-doc

1

CHAPTER 1

INTRODUCTION

1.1 EVOLUTION OF ENCRYPTION:

Encryption is an ancient art. Julius Caesar protected his written messages with a

simple code; he just shifted his letters 3 spaces to the left. (In his honor, this substitution

method is still called the Caesar Cipher.) Later, the 9th century Arab scholar Al-Kindi

produced a pioneering text entitled, A Manuscript on Deciphering Cryptographic

Messages.

Today, with the staggering amount of sensitive information that we digitally store

and transmit, encryption plays a vital role in protecting private information. The

elementary Caesar Cipher may have served its purpose in a relatively illiterate age, but

the intricate security matrix in which we live requires more advanced tools.

In 1975,IBM submits a proposal to develop a secure standard for businesses like

banks to communicate electronically. The Standard, called the Data Encryption

Standard (DES) uses what was even then considered a weak form of 56 bit encryption.

Despite being broken on numerous occassions it is still widely used today

In 1976,Whitfield Diffie and Martin Hellman publish their asymmetric key system

to the public. An asymetric key is different because it allows two users to communicate

securely without having access to a shared secret key. For example, the public key acts

as a key to lock a lock, and the private key can only unlock it. The two keys are related

mathematically, and breaking one should not affect the other. This directly led to

modern day encryption methods.

In 1990 the Electronic Frontier Foundation (EFF) revealed that the DES Standard

from 1975 was unsecure by cracking the 56 bit encryption. The Foundation then raised

awareness about the government's involvement in limiting standards to DES levels.

In 1991, The first case of widespread regulation by the government on encryption

methods was the publishing of the Pretty Good Privacy (PGP) Program by Phil

Zimmerman in 1991. The program was initially intended so that people could use

secure BBS systems and store files. The program's source code was openly distributed

and no charges were applied for its use. Zimmerman became the target of a criminal

investigation when the program began to be distributed internationally. At the time,

using systems with over 40 Bits of encryption were considered illegal, and PGP used

just under 128. Charges were eventually dropped due to public response, but the

regulations are still in place

http://bizsecurity.about.com/od/glossary/g/encryption_definition.htm

2

In 1996, after the investigations into the PGP program by Zimmermann were

completed, the government relaxed its laws in 1996 regarding local encryption ciphers.

However, international exportation laws still apply.

In 2001, the Advanced Encryption System (AES) replaces the DES as the

Encryption standard, using from 128 to 256 bit level encryption. However, most DES

systems are not upgraded.

1.2 ADVANCED ENCRYPTION STANDARD-RIJNDAEL CIPHER

AES stands for advanced encryption standard. AES is symmetric key encryption

algorithm which replaces the commonly used data encryption standard (DES).AES

provides strong encryption and was selected by NIST as a federal information

processing standard in November 2001 (FIPS-197). The AES algorithm uses three key

sizes:128-, 192-, or 256- bit encryption key. Each encryption key size causes the

algorithm to behave slightly different, so the increasing key sizes not only offer a large

number of bits with which you can scramble the data, but also increases the complexity

of cipher algorithm. AES was developed by two belgain cryptologists, Vincent Rijmen

and Joan Daemen.

BLOCK DIAGRAM:

Plaintext /cipher text output 128 bit 128-bit(plain text)

Round 128 bit key input

Clk2

Secret key

128,192,256

FIGURE 1.1 BLOCK DIAGRAM OF AES ALGORITHM [1]

AES is an algorithm for performing encryption (and reverse, decryption).the

series of well-defined steps that can be followed as a procedure. The original

information is known as plain text,and the encrypted form as cipher text.

The AES algorithm is a symmetric block cipher that can encrypt (encipher) and

decrypt(decipher) information. Encryption converts data to an unintelligible form called

cipher text; decrypting the cipher text converts the data back to its original form, called

plain text. AES can be used to protect electronic data.

Key Schedule

Generation

Encryption/

decryption

3

The cipher text message contains all the information of the plaintext message,

but is not in a format readable by a human or computer without the proper mechanism

to decrypt it.

The encrypting procedure is varied depending on the key which changes the

detailed operation of the algorithm. Without the key, the cipher cannot be used to

encrypt or decrypt.

1.3 ENCRYPTION:

Encryption is the transformation of plain text into cipher text through a

mathematical process

plain text cipher key

cipher text

FIGURE 1.2 BLOCK DIAGRAM OF ENCRYPTION PROCESS

1.4 DECRYPTION:

Decryption is a process to convert cipher text back into plain text

Cipher text plain text

Cipher key

FIGURE 1.3 BLOCK DIAGRAM OF DECRYPTION PROCESS

1.5 APPLICATION OF CRYPOTGRAPHY:

Cryptography helped ensure secrecy in important communications, such as

those of government convert operations.

This is helpful in wireless security like military communication and mobile

telephony where there is a grayer emphasis on the speed of communication

(military leaders and diplomats).

Cryptography has come to be in widespread use by many civilians who do not

have extraordinary needs for secrecy.

Although typically it is transparently built into the infrastructure for computing and

telecommunications.

ENCRYPTION

DECRYPTION

4

CHAPTER 2

THEORY

2.1 BACKGROUND OF THE ALGORITHM:

Many algorithms were originally presented by researchers from twelve different

Nations. Fifteen (15) algorithms were selected from the first set of submittals. After a

study and selection process five, (5) were chosen as finalists. The five algorithms

selected were MARS, RC6, RIJNDAEL, SERPENT and TWOFISH. The conclusion was

that the five Competitors showed similar characteristics. On October 2nd 2000, NIST

announced that the Rijndael Algorithm was the winner of the contest. The Rijndael

Algorithm was chosen since it had the best overall scores in security, performance,

efficiency, implementation ability and flexibility, [NIS00b]. The Rijndael algorithm was

developed by Joan Daemen of Proton World International and Vincent Fijmen of

Katholieke University at Leuven.

The Rijndael algorithm is a symmetric block cipher that can process data blocks

of 128 bits through the use of cipher keys with lengths of 128, 192, and 256 bits. The

Rijndael algorithm was also designed to handle additional block sizes and key lengths.

However, the additional features were not adopted in the AES. The hardware

implementation of the Rijndael algorithm can provide either high performance or low

cost for specific applications. At backbone communication channels or heavily loaded

servers it is not possible to lose processing speed, which drops the efficiency of the

overall system while running cryptography algorithms in software. On the other side, a

low cost and small design can be used in smart card applications, which allows a wide

range of equipment to operate securely.

2.1.1 AES OVERVIEW:

1976-2000: The Data Encryption Standard (DES) is considered the standard for

block ciphers by NIST.

1997-2001: With des becoming outdated NIST announces competition to

design a successor.

2001:Rijndeal, designed by Joan Daemen and Vincent Rijmen , is selected by

NIST as a AES

5

2.1.2 PROPERTIES OF AES:

Based in finite mathematic, widely analysed and considered secure,

Used for US government top secret data,

Supports 128, 196, 256 bit keys,

Unpatented, Expected to be the standard for 20+years

2.2 TYPES OF CYPHERS:

There are two classes of algorithm in encryption, an asymmetric key and

symmetric key. The following sub sections describe the both classes and a brief

discussion of algorithm is added as well.

2.2.1 ALGORITHM:

Algorithm is a process for completing a task. An encryption algorithm is a

mathematical process (mathematical formula) to encrypt and decrypt messages it

typically has two elements: data (for example , plain text or email message that you

want to encrypt or decrypt) and a key.

2.2.2 SYMMETRIC ENCRYPTION:

Symmetric encryption uses a secret key value to encrypt and decrypt

data. Both the sender and receiver need the same key to encrypt or decrypt. There are

two types of symmetric algorithms: stream algorithms and block algorithms. The stream

algorithm works on one bit or byte at a time, whereas the block algorithm works on

larger blocks of data (typically 64 bits ).the drawback of to this type of system is that if

the key is discovered, all the messages can be decrypted. Symmetric key is the key that

is used to for encrypting and decrypting a file or a message.

Examples of symmetric encryption are DES\3DES, AES, IDEA, RC6 and Blowfish.

6

2.2.3 SYMMETRIC KEY OR PRIVATE KEY:

In a symmetric or private key algorithm, in the ordinary case, the

communication only uses only one key. A user A sends the secret private key Kc to a B

user before the start of the communication between them. Both sides use the same

private key to encrypt and decrypt the exchanged information. Data encryption standard

(DES) and CAST128 are example of symmetric algorithm.

The symmetric algorithm is much faster than a asymmetric key algorithm,

which needs a bigger key and complex computation. To encrypt a large amount of data,

symmetric key algorithm is used with one secrete key. the public key algorithm then

used to encrypt and transmit the symmetric key. At the recipient, the symmetric key is

decrypted. After that all communication is made using a symmetrical algorithm. There

are two classes of private key cryptography scheme which are commonly distinguished

as block ciphers and stream ciphers.

Cipher text

Unsecured channel

Figure 2.1 Private Key Cryptography

Private key is a secret key of public-private key cryptography system (it is used in

asymmetric cryptography). The private key is normally known only to the key owner.

Messages are encrypted using a public key and can be decrypted by the owner of the

corresponding private key. For the digital signatures, however, a document is signed

with a private key and authenticated with the corresponding public key. Private Key

should not be distributed.

2.2.4 ASYMMETRIC ENCRYPTION:

Asymmetric encryption (asymmetric cipher) uses a separate key for encryption and

decryption. The decryption key is very hard or even impossible to derive from the

encryption key . the encryption key is public so that anyone can encrypt a message .

however , the decryption key is private , so that only the receiver is able to decrypt the

message. it is common to set up a pair of keys within a network so that each user has

public key and a private key .The public key is made available to everyone so that they

Private key kc

Plain text

User A

ENCRYPTION

Private key kc

plaintext

User B

DECRYPTION

7

can send messages, but the private key is only made available to the person it belongs

to.Asymmetric cipher that uses different (not trivially related) keys for encryption and

decryption. Asymmetric cipher that uses different (not trivially related) keys for

encryption and decryption.

Examples of asymmetric encryption are RSA, ELGAMAL.

2.2.5 ASYMMETRIC KEY FOR PUBLIC KEY:

In asymmetric key algorithm, there are two keys. One must be public and it is

used to encrypt the data. The other key is a private one and it is used to decrypt the

information. In communication between A and B,A uses the public key ke of B to

encrypt the message ,in a way that only B(neither A).can decrypt this message using

his private key Kd. The system is also used to sign a message digitally. Rivest – Shamir

- Adleman (RSA) is widely used asymmetric key algorithm for decrease elliptic curve

cryptography (ECC) as an alternative to RSA which offers highest security its small bit

length of key.

Cipher text

Unsecure channel

Figure:2.2 Public Key Cryptography.

A public key is the public key of a public-private key cryptography system. Public

key is used in asymmetric cryptography. Public keys are used to enable someone to

encrypt messages intended for the owner of the public key. Public keys are meant for

distribution, so anyone who wants to send an encrypted message to the owner of the

public key can do so, but the owner of the public key can do so, but only the owner of

the corresponding private key can decrypt the message. Cryptography based on

methods involving a public key and a private key.

2.2.6 CIPHER TEXT:

This is the encrypted message produced by applying the algorithm to the

plaintext message using the secret key.

Private key ke

Plain text

User A

ENCRYPTION

Private key kd

plaintext

User B

DECRYPTION

8

2.2.7 BLOCK CIPHER:

Block chipper is a type of the symmetric-key encryption algorithm that transforms

a fixed-length block of plaintext data into block of cipher text data of the same length.

This transformation takes places under the action of a user-provided secret key.

Decryption is performed by applying the reverse transformation to the cipher text block

using the same secret key. The fixed length is called the block size, and for many block

ciphers, the block size is 64 and the block size increase to 128, 192 or 256 bits as

processors become more sophisticated. Below figure illustrates the block cipher

transformation. The cipher like DES, triple-DES and blowfish are example of block

cipher.

Plaintext ciphertext

key key

Ciphertext plaintext

FIGURE 2.3 BLOCK CIPHER

Since different plaintext blocks are mapped to different cipher text blocks (to

allow unique decryption), a block cipher effectively provides a permutation of the set of

all possible messages. The permutation effected during any particular encryption is a

secret, since it is a function of the secret key

Block cipher

encryption

Block cipher

decryption

9

2.3 THE AES CIPHER:

Block length is limited to 128 bit

The key size can be independently specified to 128,19 or 256 bits

TABLE 2.1: REPRESENTED OF KEY SIZE, NUMBER OF ROUNDS, EXAPANDED KEY SIZE

2.4 NOTATION AND CONVENTIONS:

2.4.1. INPUTS AND OUTPUTS:

The input and output for the AES algorithm consists of sequences of 128 bits.

These sequences are referred to as blocks and the numbers of bits they contain are

referred to as their length. The Cipher Key for the AES algorithm is a sequence of

128,192 or 256 bits. Other input, output and Cipher Key lengths are not permitted by

this standard. The bits within such sequences are numbered starting at zero and ending

at one less than the sequence length, which is also termed the block length or key

length. The number ―i‖ attached to a bit is known as its index and will be in one of the

ranges 0 ≤ i<128, 0 ≤ i< 192 or 0 ≤ i< 256 depending on the block length or key length

specified.

2.4.2. BYTES:

The basic unit of processing in the AES algorithm is a byte, which is a sequence

of eight bits treated as a single entity. The input, output and Cipher Key bit sequences

described in Section 1.1 are processed as arrays of bytes that are formed by dividing

these sequences into groups of eight contiguous bits to form arrays of bytes. For an

input, output or Cipher Key denoted by a, the bytes in the resulting array are referenced

using one of the two forms, an or a[n], where n will be in a range that depends on the

key length. For a key length of 128 bits, n lies in the range 0 ≤ n < 16. For a key length

of 192 bits, n lies in the range 0 ≤ n < 24. For a key length of 256 bits, n lies in the range

0≤ n < 32.

All byte values in the AES algorithm are presented as the concatenation of the

individual bit values, (0 or 1), between braces in the order{b7, b6, b5, b4, b3, b2, b1,

b0}.These bytes are interpreted as finite field elements using a polynomial

representation

b7 x7+ b6 x

6+ b5 x5+ b4 x

4+ b3 x3+ b2 x

2+ b1 x1+ b0x

0=Σbi xi

Key

size(word/byte/bits)

4/16/192 6/24/192 8/32/256

Number of rounds 10 12 14

Expanded key

size(words/byte)

44/176 52/208 60/240

10

For example, {01100011} identifies the specific finite field element x6 + x5 + x +1. It is

also convenient to denote byte values using hexadecimal notation with each of two

groups of four bits being denoted by a single hexadecimal character. The hexadecimal

notation scheme is depicted in Figure.1.

TABLE 2.2 HEXADECIMAL REPRESENTATION OF BIT PATTERNS

[1]

Hence the element {01100011} can be represented as {63}, where the character

denoting the four-bit group containing the higher numbered bits is again to the left.

Some finite field operations involve one additional bit {b8} to the left of an 8-bit byte.

When the b8 bit is present, it appears as {01} immediately preceding the 8-bit byte. For

example, a 9-bit sequence is presented as {01} {1b}.

2.4.3. ARRAYS OF BYTES:

Arrays of bytes are represented in the form a0a1a2···a15. The bytes and the bit

ordering within bytes are derived from the 128-bit input sequence, input0input1input2

···input126input127 as a0 = {input0, input1, ···, input7} , a1 = {input8, input9, ···, input15}

with the pattern continuing up to a15 = {input120, input121, ···, input127}.

The pattern can be extended to longer sequences associated with 192 and 256

bit keys. In general,

an = {input8n, input8n+1, ···, input8n+7}.

An example of byte designation and numbering within bytes for a given input sequence

is presented in Figure 2.

Figure 2.4: Indices for Bytes and Bits

[1]

2.4.4. THE STATE:

Internally, the AES algorithm‘s operations are performed on a two dimensional

array of bytes called the State. The State consists of four rows of bytes. Each row of a

state contains Nb numbers of bytes, where Nb is the block length divided by 32. In the

State array, which is denoted by the symbol S, each individual byte has two indices.

The first byte index is the row number r, which lies in the range 0 ≤r ≤ 3 and the second

byte index is the column number c, which lies in the range 0 ≤ c ≤ Nb−1. Such indexing

11

allows an individual byte of the State to be referred to as Sr,c or S[r,c]. For the AES Nb =

4, which means that 0 ≤c ≤ 3. At the beginning of the Encryption and Decryption the

input, which is the array of bytes symbolized by in0in1···in15 is copied into the State

array. This activity is illustrated in Figure 3. The Encryption or Decryption operations are

conducted on the State array. After manipulation of the state array has completed its

final value is copied to the output, which is an array of bytes symbolized by

out0out1···out15.

Input state array output byte

FIGURE 2.5: STATE ARRAY INPUT AND OUTPUT

[1]

At the start of the Encryption or Decryption the input array is copied to the State array

with

S[r, c] = in[r + 4c]

where 0 ≤r ≤3 and 0 ≤c ≤ Nb−1 At the end of the Encryption and Decryption the State is

copied to the output array with

out[r + 4c] = S[r,c]

where 0 ≤ r ≤ 3 and 0 ≤ c ≤ Nb−1.Input Bytes State Array Output Bytes

2.4.5. THE STATE AS AN ARRAY OF COLUMNS:

The four bytes in each column of the State form 32-bit words, where the row

number ―r‖ provides an index for the four bytes within each word. Therefore, the state

can be interpreted as a one-dimensional array of 32 bit words, which is symbolized by

w0...w3. The column number c provides an index into this linear State array. Considering

the State depicted in Figure3, the State can be considered as an array of four words

where

w0 = S0,0 S1,0 S2,0 S3,0,

w1 = S0,1 S1,1 S2,1 S3,1,

w2 = S0,2 S1,2 S2,2 S3,2 and

w3 =S 0,3 S1,3 S2,3 S3,3.

12

2.4.6 MATHEMATICAL BACKGROUND:

Every byte in the AES algorithm is interpreted as a finite field element using the

notation introduced in Section.1.1.2. All Finite field elements can be added and

multiplied. However, these operations differ from those used for numbers and their use

requires investigation.

2.4.6(A) ADDITION:

The addition of two elements in a finite field is achieved by ―adding‖ the

coefficients for the corresponding powers in the polynomials for the two elements. The

addition is performed through use of the XOR operation, which is denoted by the

operator symbol ⊕. Such addition is performed modulo-2. In modulo-2 addition

1 ⊕ 1 = 0,

1 ⊕ 0 = 1,

0 ⊕ 1 = 1

and

0 ⊕ 0 =0.

Consequently, subtraction of polynomials is identical to addition of polynomials.

Alternatively, addition of finite field elements can be described as the modulo-2 addition

of corresponding bits in the byte. For two bytes {a7a6a5a4a3a2a1a0}

and{b7b6b5b4b3b2b1b0}, the sum is {c7c6c5c4c3c2c1c0}, where each ci = ai ⊕ bi where i

represents corresponding bits. For example, the following expressions are equivalent to

one another.

(x6 + x4 + x2 + x +1) + (x7 + x +1) = x7 + x6 + x4 + x2 (Polynomial notation)

{01010111}⊕ {10000011} = {11010100} (Binary notation)

{57}⊕{83} = {d4} (Hexadecimal notation)

2.4.6(B) MULTIPLICATION:

In the polynomial representation, multiplication in Galois Field GF (28) (denoted

by•) corresponds with the multiplication of polynomials modulo an irreducible polynomial

of degree 8. A polynomial is irreducible if its only divisors are one and itself. For the

AES algorithm, this irreducible polynomial is given by the below equation

m(x) = x8 + x4 + x3 + x +1

13

For example, {57}•{83} = {c1} because

(x6 + x4 + x2 + x +1)(x7 + x +1) = x13 + x11 + x9 + x8 + x7 + x7 + x5 + x3 +x2 + x + x6 + x4 +

x2 + x + 1

= x13 + x11 + x9 + x8 + x6 + x5 + x4 + x3 +1

x13 + x11 + x9 + x8 + x6 + x5 + x4 + x3 +1 Modulo (x8 + x4 + x3 + x +1)

= x7 + x6 +1.

The modular reduction by m(x) ensures that the result will be a binary polynomial

of degree less than 8, which can be represented by a byte. Unlike addition, there is no

simple operation at the byte level that corresponds to this multiplication. The

multiplication defined above is associative and the element {01} is the multiplicative

identity. For any non-zero binary polynomial b(x) of degree less than 8, the multiplicative

inverse of b(x), denoted b-1(x), can be found. The inverse is found through use of the

extended Euclidean algorithm to compute polynomials a(x) and c(x) such that

b(x)a(x) + m(x)c(x) = 1.

Hence, a(x) • b(x) mod m(x) = 1, which means

b−1 (x) = a(x)modm(x)

Moreover, for any a(x), b(x) and c(x) in the field, it holds that

a(x) • (b(x) + c(x)) = a(x) • b(x) + a(x) • c(x)

It follows that the set of 256 possible byte values, with XOR used as addition and

multiplication defined as above, has the structure of the finite field GF (28).

2.4.6(C) MULTIPLICATION BY X:

Multiplying the binary polynomial defined in equation (1) with the polynomial x

results in

b7 x8+ b6 x

7+ b5 x6+ b4 x

5+ b3 x4+ b2 x

3+ b1 x2+ b0x

1

The result x • b(x) is obtained by reducing the above result modulo m(x). If b7

equals zero the result is already in reduced form. If b7 equals one the reduction is

accomplished by subtracting the polynomial m(x). It follows that multiplication by x,

which is represented by {00000010} or {02}, can be implemented at the byte level as a

left shift and a subsequent conditional bitwise XOR with {1b}. This operation on bytes is

denoted by xtime(). Multiplication by higher powers of x can be implemented by

repeated application of xtime(). Through the addition of intermediate results,

multiplication by any constant can be implemented.

14

For example, {57} • {13} = {fe} because

{57} • {02} = xtime ({57}) = {ae}

{57} • {04} = xtime ({ae}) = {47}

{57} • {08} = xtime ({47}) = {8e}

{57} • {10} = xtime ({8e}) = {07},

Thus,

{57} • {13} = {57} • ({01} • {02} • {10})

= {57} • {ae} • {07}

= {fe}.

2.5 POLYNOMIALS WITH COEFFICIENTS IN GF (28):

Four-term polynomials can be defined with coefficients that are finite field

elements as the following equation (7)

a(x) = a3 x3 + a2 x

2+ a1 x1 + a

which will be denoted as a word in the form [a0 , a1 , a2 , a3 ]. Note that the

polynomials in this section behave somewhat differently than the polynomials used in

the definition of finite field elements, even though both types of polynomials use the

same indeterminate, x. The coefficients in this section are themselves finite field

elements, i.e., bytes, instead of bits; also, the multiplication of four-term polynomials

uses a different reduction polynomial, defined below. To illustrate the addition and

multiplication operations, let

b(x) = b3 x3+ b2 x

2 + b1 x1 + b

define a second four-term polynomial. Addition is performed by adding the finite field

coefficients of like powers of x. This addition corresponds to an XOR operation between

the corresponding bytes in each of the words – in other words, the XOR of the complete

word values Thus, using the equations of (7) and (8),

a( x) + b( x) = (a3⊕b3) x3 + (a2⊕b2)x

2 + (a1 ⊕b1) x1 + (a0 ⊕b0)x

0

Multiplication is achieved in two steps. In the first step, the polynomial product

c(x) = a(x) • b(x) is algebraically expanded, and like powers are collected to give

c(x) = c6 x6 + c 5x

5 + c4 x4 + c3 x

3 + c2 x2 + c1 x

1 + c0x0

where

c0= a0. b0

c1= a1.b0 ⊕ a0. b1

c2=a2 .b0 ⊕ a1. b1 ⊕ a0. b2

c3= a3. b0 ⊕a2. b1 ⊕ a1. b2 ⊕ a0. b3

15

c4=a3 .b1⊕ a2. b2 ⊕ a0. b3

c5= a3. b2 ⊕ a2.b3

c6= a3. b3

The result, c(x), does not represent a four-byte word. Therefore, the second step

of the multiplication is to reduce c(x) modulo a polynomial of degree 4; the result can be

reduced to a polynomial of degree less than 4. For the AES algorithm, this is

accomplished with the polynomial x4 + 1, so that

xi mod(x4 +1) = ximod 4 .

The modular product of a(x) and b(x), denoted by a(x) • b(x), is given by the four-

term polynomial d(x), defined as follows

d(x) = d3 x3 + d2 x

2 + d1 x1 + d0

with

d0= (a0 .b0) ⊕ (a3. b1) ⊕ (a2 .b2) ⊕ (a1. b3)

d1 =(a1. b0) ⊕ (a0. b1) ⊕ (a3. b2) ⊕(a2. b3)

d2= (a2 .b0) ⊕ (a1. b1) ⊕ (a0 .b2)⊕ (a3. b3)

d3= (a3. b0) ⊕ (a2 .b1) ⊕ (a1. b2)⊕( a0. b3)

When a(x) is a fixed polynomial, the operation defined in equation can be written in

matrix form as the following equation below.

d0 a0 a3 a2 a1 b0

d1 = a1 a0 a3 a2 b1

d2 a2 a1 a0 a2 b2

d3 a3 a2 a1 a0 b3

Because x4 + 1 is not an irreducible polynomial over GF(28), multiplication by a fixed

four-term polynomial is not necessarily invertible. However, the AES algorithm specifies

a fixed four-term polynomial that does have an inverse is given by

a(x) = {03}x3 +{01}x2 +{01}x +{02

a−1 (x) = {0b}x3 +{0d}x2 +{09}x +{0e}

Another polynomial used in the AES algorithm has a0 = a1 = a2 = {00} and a3

={01}, which is the polynomial x3. Inspection of equation (13) above will show that its

effect is to form the output word by rotating bytes in the input word. This means that

[b0,b1,b2, b3] is transformed into [b1, b2, b3, b0].

16

2.6. ENCRYPTION PROCESS:

This block diagram is generic for aes specifications.it consists of a number of

different transformations applied consecutively over the data block bits, in a fixed

number of iterations, called rounds. The number of rounds depends on the length of the

key used for the encryption process.

A 128 bit input and output block of AES is mapped to an AES state by putting

thefirst byte of the block in the upper left corner of the matrix and by filling in the

remaining bytes column by column. A round consists of a fixed sequence of

transformations. Except for the first round and the last round,

Plain text

Roundkey 1st Round

Roundkey

Repeat

Nr-1 Round

Last

Roundkey Round

FIGURE 2.6:BLOCK DIAGRAM OF ENCRYPTION

The other rounds are identical and consist of four transformations. The four

transformations are invertable, hence the round itself is invertible.

SubBytes

Shift Rows

MixColumns

AddRoundKey

SubBytes

ShiftRows

AddRoundKey

SubBytes ShiftRows Mix Columns Add Round Key

AddRoundKey

17

Data block

Key

Data block

FIGURE 2.7:STRUCTURE OF THE ONE ROUND

2.6.1. BYTES SUBSTITUTION TRANSFORMATION:

The bytes substitution transformation subbytes (state) is a non-linear substitution

of bytes that operates independently on each byte of the State using a substitution table

(Sbox) presented in figure7. This S-box which is invertible, is constructed by composing

two transformations

1. Take the multiplicative inverse in the finite field GF (28), described in Section

1.3.2. The element {00} is mapped to itself.

2. Apply the following affine transformation (over GF (2))

b′=bi ⊕ b (i+4)mod8⊕ b(i+5)mod8 ⊕ b(i+6)mod8 ⊕ b(i+7)mod8 ⊕ ci

for 0≤ i ≤ 8 , where bi is the ith bit of the byte, and ci is the ith bit of a byte c with the value

{63} or {01100011}. Here and elsewhere, a prime on a variable (e.g., b′ ) indicates that

the variable is to be updated with the value on the right. In matrix form, the affine

transformation element of the S-box can be expressed as

SUBBYTES

SHIFTROWS

MIXCOLUMNS

ADDROUNDKEY

18

b0′ 1 0 0 0 1 1 1 1 b0 1

b1′ 1 1 0 0 0 1 1 1 b1 1

b2′ 1 1 1 0 0 0 1 1 b2 0

b3′ = 1 1 1 1 0 0 0 1 b3 0

b4′ 1 1 1 1 1 0 0 0 b4 0

b5′ 0 1 1 1 1 1 0 0 b5 0

b6′ 0 0 1 1 1 1 1 0 b6 1

b7′ 0 0 0 1 1 1 1 1 b7 0

FIGURE 2.8 MATRIX NOTATION OF S-BOX

FIGURE 2.9. APPLICATION OF S-BOX TO THE EACH BYTE OF THE STATE [1]

The S-box used in the Sub Bytes transformation is presented in hexadecimal

form in figure 7. For example, if =S1,1= {53}, then the substitution value would be

determined by the intersection of the row with index ‗5‘ and the column with index ‗3‘ in

figure 7. This would result in S'1,1having a value of {ed}.

19

FIGURE 2.10. S-BOX VALUES FOR ALL 256 COMBINATIONS IN HEXADECIMAL FORMAT

[1]

2.6.2. SHIFT ROWS TRANSFORMATION:

In the Shift Rows transformation ShiftRows( ), the bytes in the last three rows of

the State are cyclically shifted over different numbers of bytes (offsets). The first row, r

=0, is not shifted. Specifically, the ShiftRows( ) transformation proceeds as follows

S`r ,c = Sr,(c shift(r,Nb))modNb for 0< r < 4 and 0≤ c≤Nb,

Where the shift value shift(r, Nb) depends on the row number, r, as follows (Nb = 4)

Shift(1,4) = 1: Shift(2,4) = 2; Shift(3,4) = 3.

This has the effect of moving bytes to ―lower‖ positions in the row (i.e.,lower

values of c in a given row), while the ―lowest‖ bytes wrap around into the ―top‖ of the row

(i.e., higher values of c in a given row). Figure 7 illustrates the ShiftRows()

transformation.

20

Figure 2.11. Cyclic Shift of the Last Three Rows of the State

[1]

2.6.3. MIXING OF COLUMNS TRANSFORMATION:

This transformation is based on Galois Field multiplication. Each byte of a

column is replaced with another value that is a function of all four bytes in the given

column. The MixColumns( ) transformation operates on the State column-by-column,

treating each column as a four-term polynomial as described in Section.1.3.4. The

columns are considered as polynomials over GF (28) and multiplied modulo x4 + 1 with a

fixed polynomial a(x), given by the following equation.

a(x) = {03}x3 +{01}x2 +{01}x1 +{02}x0.

As described in Section. 1.3.4, this can be written as a matrix multiplication. Let

S ' (x) = a(x) ⊗S(x)

S '0,c 02 03 01 01 S0,c

S '1,c 01 02 03 01 S1,c

=

S '2,c 01 01 02 03 S2,c

S '3,c 03 01 01 02 S3,c for 0 ≤ c < Nb.

21

As a result of this multiplication, the four bytes in a column are replaced by the following

S '0,c =({02}. S0,c ) ⊕ ({03}.S1,c) ⊕ ({01}.S2,c) ⊕ ({01}.S3,c )

S '1,c =({01}. S0,c ) ⊕ ({02}.S1,c) ⊕ ({01}.S2,c) ⊕ ({01}.S3,c )

S '2,c =({01}. S0,c ) ⊕ ({01}.S1,c) ⊕ ({02}.S2,c) ⊕ ({03}.S3,c )

S '3,c =({03}. S0,c ) ⊕ ({01}.S1,c) ⊕ ({01}.S2,c) ⊕({02}.S3,c )

FIGURE 2.12. MIXING OF COLUMNS OF THE STATE

[1]

Understanding Of Calculations For Mix-Columns

For detailed understanding of calculations for mix-columns is as follows

The mix columns theory is calculated using this formula[1]:

r0 2 3 1 1 a0

r1 = 1 2 3 1 a1

r2 1 1 2 3 a2

r3 3 1 1 2 a3

where r0, r1, r2 and r3 are the results after the transformation. a0 – a3 can be obtain

from the matrix after the data undergoes substitution process in the S-Boxes.

Let's take this example:

22

a0-a3 r0-r3

02 03 01 01

01 02 03 01 =

01 01 02 03

03 01 01 02

In this example, a0 – a3 is equals to d4 – 30 and r0 – r3 is equals to 04 – e5.

note that in this it still follows the matrix multiplication rules: row x column. Currently the

matrix size looks like this:

[4 x 1] . [4 x 4] ≠ [4 x 1]

Remember matrix idea of multiplication, to obtain [4 x 1], then the formula to be

[4 x 4] . [4 x 1] = [4 x 1]

Therefore to switch matrices over.

02 03 01 01

01 02 03 01 x =

01 01 02 03

03 01 01 02

To calculate the results, multiply the rows with the column. Firstly, take the first

row of the first matrix and multiply the values with a's values.

To get the r0 value, the formula goes like this:

r0 = {02.d4} + {03.bf} + {01.5d} + {01.30}

But when calculating directly go into the steps one at a time.

1. {02.d4}

Now converting d4 to binary. d4 is a byte so when using the Calculator change it

to byte under Hex mode.

d4 = 1101 0100

d4

bf

5d

30

04

66

81

e5

04

66

81

e5

d4

bf

5d

30

23

Now d4 is exactly 8 bits. In the case where never get a 8 bits long characters

such as 25 in Hex (converted: 100101), pad on with 0 in the front of the result until 8

characters of 1's and 0's. (25 ends up with 0010 0101)

Now another thing to remember, there is a rule established in the multiplication of

the values as written in the book, Cryptography and Network Security[2], that

multiplication of a value by x (ie. by 02) can be implemented as a 1-bit left shift followed

by a conditional bitwise XOR with (00011011) if the leftmost bit of the original value

(before the shift) is 1. now implement the rule in the calculation.

{d4}.{02} = 1101 0100 << 1 (<< is left shift, 1 is the number of shift done, pad on with

0's)

= 1010 1000 XOR 0001 1011 (because the leftmost is a 1 before shift)

= 1011 0011 (ans)

Calculation:

1010 1000

0001 1011 (XOR)

1011 0011

Now do the same for next set of values, {03.bf}

2. {03.bf}

Similarly, convert bf into binary:

bf = 1011 1111

In this case, multiply 03 to bf. split 03 up in its binary form.

03 = 11

= 10 XOR 01

It is now able to calculate the result.

{03} . {bf} = {10 XOR 01} . {1011 1111}

= {1011 1111 . 10} XOR {1011 1111 . 01}

= {1011 1111 . 10} XOR {1011 1111}

(Because {1011 1111} x 1[in decimal] = 1011 1111)

= 0111 1110 XOR 0001 1011 XOR 1011 1111

= 1101 1010 (ans)

24

{01.5d} and {01.30} is basically multiplying 5d and 30 with 1(in decimal) which end up

with the original values. There isn't a need to calculate them using the above method.

But its is not needed to convert values to binary form.

5d = 0101 1101

30 = 0011 0000

Now, add those values together. As the values are in binary form, addition will be

using XOR.

r0 = {02.d4} + {03.bf} + {01.5d} + {01.30}

= 1011 0011 XOR 1101 1010 XOR 0101 1101 XOR 0011 0000

= 0000 0100

= 04 (in Hex)

Now for the next row.

r1 = {01.d4} + {02.bf} + {03.5d} + {01.30}

1. {02.bf}

{bf} . {02} = 1011 1111 << 1

= 0111 1110 XOR 0001 1011

= 0110 0101

2. {03.5d}

{5d} . {03} = {0101 1101. 02} XOR { 0101 1101}

= 1011 1010 XOR 0101 1101

= 1110 0111

Therefore,

r1 = {01.d4} + {02.bf} + {03.5d} + {01.30}

= 1101 0100 XOR 0110 0101 XOR 1110 0111 XOR 0011 0000

= 0110 0110

= 66 (in Hex)

second values are obtained, 66. Do the same for the rest and will get all the results.

2.6.4 Addition of Round Key Transformation

In the Addition of Round Key transformation AddRoundKey( ), a Round Key is

added to the State by a simple bitwise XOR operation. Each Round Key consists of Nb

words from the key schedule generation (described in following section 2.6). Those Nb

words are each added into the columns of the State, such that

25

[S'0,c , S'1,c , S'2,c , S'3,c ] = [S0,c ,S1,c,S2,c ,S3,c ] ⊕ [Wround ⊕ Nb] for 0 ≤ c<Nb,

FIGURE 2.13. EXCLUSIVE-OR OPERATION OF STATE AND CIPHER KEY WORDS [1]

where [wi] are the key generation words described in chapter 3, and round is a value in

the range in the Encryption, the initial Round Key addition occurs when round = 0, prior

to the first application of the round function. The application of the AddRoundKey ( )

transformation to the Nr rounds of the encryption occurs when 1 ≤ round ≤ Nr. The

action of this transformation is illustrated in figure10, where l = round * Nb. The byte

address with in words of the key schedule was described in Section1.2.1.

2.6.5 Key Schedule Generation:

Each round key is a 4-word (128-bit) array generated as a product of the previous

round key, a constant that changes each round, and a series of S-Box (figure6) lookups for

each 32-bit word of the key. The first round key is the same as the original user input.

Each byte (w0 - w3) of initial key is XOR‘d with a constant that depends on the current

round, and the result of the S-Box lookup for wi, to form the next round key. The number of

rounds required for three different key lengths is presented in figure11.

26

Key

length

(nk

words)

Block

size

(nb word)

Number of

Rounds(nr)

AES-

128

4 4 10

AES-

192

6

4 12

AES-

256

8

4 14

TABLE 2.3. KEY-BLOCK- ROUND COMBINATIONS

[1]

The Key schedule Expansion generates a total of Nb(Nr + 1) words: the

algorithm requires an initial set of Nb words, and each of the Nr rounds requires Nb

words of key data. The resulting key schedule consists of a linear array of 4-byte words,

denoted [wi], with i in the range 0 ≤ i < Nb(Nr + 1).

Prior to encryption or decryption the key must be explanded.the expanded key is

used in the add round key function. Each time the add round key function is called a

different part of the expanded key is XORed against the state. In order for this to work

the expanded key must be large enough so that it can provide key material for every

time the add round key function is executed. The add round key function gets called for

each round as well as one extra time at the beginning of the algorithm.

Therefore the size of the expanded key will always be equal to:

The 16 in the above function is actually the size of the block in bytes.tis provides key

material for every byte in the block during every round +1.

Since the key size is much smaller than the size of the sub keys,the key is actually

―streached out‖ to provide enough key space for the algorithm.

The key expansion routine executes a maximum of 4 consecutive functions.these

functions are:

ROT WORD

SUB WORD

RCON

An iteration of the above steps is caller a round.the amount of rounds of the key

expansion depends on the key size

27

Key

size

(bytes

)

Block

size

(bytes

)

Expansio

n

algorithm

rounds

Expande

d bytes /

round

Round

s key

copy

Rounds

key

expansio

n

Expande

d key

(bytes)

16 16 44 4 4 40 176

24 16 52 4 6 46 208

32 16 60 4 8 52 240

TABLE 2.4 REPRESENTATION OF AES-128, AES-192,AES-256 BIT BLOCK SIZE, EXPANSION

ALGORITHM, ROUND KEY COPY, ROUND KEY

The first bytes of the expanded keys are always equal to the key.if the key is 16

bytes long the first 16 bytes of the expanded key will be the same as the original key. If

the key size is 32 bytes then the first 32 bytes of the expanded key will be the same as

the original key.

Each round adds 4 bytes to the expanded key. With the exception of the first

rounds each round also takes the previous rounds 4 bytes as input operates and returns

4 bytes.

One more important note is that not all of the 4 functions are always called in each

round. The algorithm only calls all 4 of the functions every

4 rounds for 16 bytes key

6 ruonds for 24 bytes key

8 rounds for 32 bytes key

The rest of the rounds only a k function result is XORed with the result of the EK

function. There is an exception of this rule where if the key is 32 bytes long an

additional call to the sub word function is called every 8 rounds starting on the 13th

round.

Rijndael‘s key schedule utilizes a number of operations, which will be described

before describing the key schedule.

ROTATE:

The rotate operation takes a32 bit word like this (in hexa decimal):

1d2c3a4f

And rotates it eight bits to left:

2c3a4f1d

RCON:

Rcon is what the rijndael documentation calls the exponentiation of 2 to a user-

specified values. Noe that this operation is not performed with regular integers, but in

rijndael‘s finite field. In polynomial form , 2 is 2 =00000010 =0x7 + 0x6 + 0x5 + 0x4 + 0x3 +

0x2 + 1x1 + 0x0 = x, and compute

rcon(i) =x(254+i)

28

in F28 or equivalently,

rcon(i) =x(254+i) mod x8 + x4 +x3 +x1 +1

in F2.

For example, the rcon(1)=1,the rcon(2)=2, the rcon(3)=4,and the rcon (9) is the

hexadecimal number 0x1b(27 decimal).

The below is rcon table for encryption process

TABLE 2.5 RCON TABLE IN ENCRYPTION PROCESS

2.7. Decryption Process:

This process is direct inverse of the encryption process. All the transformations

applied in encrypton process are inversely applied to this process.hence thelast round

values of both the data and key are first round inputs for the decryption process and

follow in decreasing order.

Cipher text

Roundkey* 1st round

Roundkey* repeat

Nr-1

round

Last round

Roundkey*

Plain text

FIGURE 2.14:BLOCK DIAGRAM OF DECRYPTION PROCESS

01 02 04 08 10 20 40 80 1B 36

00 00 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 00 00

AddRoundKey

InvShiftRows

InvSubBytes

AddRoundKey

InvMixColumns

InvShiftRows

InvSubBytes

AddRoundKey

29

2.7.1. INVERSE BYTES SUBSTITUTION TRANSFORMATION:

Inverse Byte Substitution Transformation InvSubBytes( ) is the inverse of the

byte substitution transformation, in which the inverse S-Box (figure14) is applied to each

byte of the State. This is obtained by applying the inverse of the affine transformation to

the equation (16) followed by taking the multiplicative inverse in GF (28).

`` FIGURE 2.15. APPLICATION OF THE INVERSE S-BOX TO EACH BYTE OF THE STATE

[1]

FIGURE 2.16. INVERSE S-BOX VALUES FOR ALL 256 COMBINATIONS IN HEXADECIMAL FORMAT

30

2.7.2. INVERSE SHIFT ROWS TRANSFORMATION:

Inverse Shift Rows Transformation InvShiftRows( ) is the inverse of the

ShiftRows( ) transformation presented in Chapter2. The bytes in the last three rows of

the State are cyclically shifted over different numbers of bytes. The first row, r = 0, is not

shifted. The bottom three rows are cyclically shifted by Nb-shift(r, Nb) bytes, where the

shift value shift(r, Nb) depends on the row number, and is explained in Section.2.3.

Specifically, the InvShiftRows ( ) transformation proceeds as follows

S' r,(c+ shift( r,Nb))mod Nb =Sr,c for 0≤ r<4 and 0≤ c<Nb

FIGURE 2.17. INVERSE CYCLIC SHIFT OF THE LAST THREE ROWS OF THE STATE

[1]

2.7.3. INVERSE MIXING OF COLUMNS TRANSFORMATION:

Inverse Mixing of Columns Transformation InvMixColumns( ) is the inverse of the

MixColumns ( ) transformation presented in chapter2. InvMixColumns ( ) operates on

the State column-by-column, treating each column as a four term polynomial as

described in Section.1.3.4. The columns are considered as polynomials over GF (28)

and multiplied modulox4 + 1 with a fixed polynomial a-1(x), given by

a−1(x) = {0b}x3 +{0d}x2 +{09}x1 +{0e}x0.

As described in Section.1.3.4, this can be written as a matrix multiplication. Let

S '(x) = a−1(x) ⊗S(x)

31

S '0,c 0c 0b 0d 09 S0,c

S '1,c 09 0c 0b 0d S1,c

=

S '2,c 0d 09 0c 0b S2,c

S '3,c 0b 0d 09 0c S3,c for 0 ≤ c<Nb.

As a result of this multiplication, the four bytes in a column are replaced by the following

equations.

S '0,c =({0c}. S0,c ) ⊕ ({0b}.S1,c) ⊕ ({0d}.S2,c) ⊕({09}.S3,c )

S '1,c =({09}. S0,c ) ⊕ ({0c}.S1,c) ⊕ ({0b}.S2,c) ⊕({01}.S3,c )

S '2,c =({0d}. S0,c ) ⊕ ({09}.S1,c) ⊕ ({0c}.S2,c) ⊕({0b}.S3,c )

S '3,c =({0b}. S0,c ) ⊕ ({0d}.S1,c) ⊕ ({09}.S2,c) ⊕({0c}.S3,c )

Hence, State can be represented as

FIGURE 2.18. INVERSE MIX COLUMN OPERATION ON STATE

[1]

For detailed understanding the calculation for the inverse mix-columns operation is as

follows

Understanding Of Calculations For Inverse Mix-Columns:

Multiplication of Bits

In encryption process, Mix-Columns[2], mentioned about shifting of 1 bits to the

left and an operation XOR if the leftmost bit before the move is 1. In Inverse Mix-

Columns, this idea still works however with one thing in mind: no longer multiplying

numbers like 2, 3 or 1. but rather 13, 11 and 9. the value to compute is[1] :

32

0101 0111×1000 0011

To find the answer, we have to do it in steps. Let's go through it step by step:

Step 1: Split 1000 0011 into smaller bits.

When this, it means there is only one 1 bit in the 8 bits value (eg. 1000 0000). To get

1000 0011 the value will be this:

1000 0011=1000 0000 XOR 0000 0010 XOR 0000 0001

Step 2: Determine the results of multiplication with 0101 0111

This is the part where you need to pay a little more than usual focus. A single

mistake in the calculation might cause the rest of the calculation wrong. Let's start

multiplying then.

0101 0111 x 0000 0001 = 0101 0111

This part is the same as anything multiply by 1 since 0000 0001 = 1 in decimal.

0101 0111 x 0000 0010 = 0101 0111 << 1

= 1010 1110

Remember this is a shift in bit to the left and appending a 0 at the end.

0101 0111 x 0000 0100 = (0101 0111 x 0000 0010) << 1 XOR 0001 1011

= (1010 1110 << 1) XOR 0001 1011

= 0101 1100 XOR 0001 1011

= 0100 0111

Notice that there is an XOR value in this calculation. This is because the left most

bit in the original value is 1. In our previous calculation, there is only a shift in bits as the

original value has 0 as its left most bit. Therefore this is the conditional XOR in the shift

of value in our calculation. Let's continue then

.

0101 0111 x 0000 1000 = (0101 0111 x 0000 0100) << 1

= 1000 1110

0101 0111 x 0001 0000 = (0101 0111 x 0000 1000) << 1 XOR 0001 1011

= 0001 1100 XOR 0001 1011

= 0000 0111

0101 0111 x 0010 0000 = (0101 0111 x 0001 0000) << 1

= 0000 1110

33

0101 0111 x 0100 0000 = (0101 0111 x 0010 0000) << 1

= 0001 1100

0101 0111 x 1000 0000 = (0101 0111 x 0100 0000) << 1

= 0011 1000

Once we are done with this, we are ready to proceed on.

Step 3: Get the final value, 0101 0111×1000 0011

In step 1 we have already split 1000 0011 into smaller parts. We can now use that in

our calculation

0101 0111 x 1000 0011 = 0101 0111 x (1000 0000 XOR 0000 0010 XOR 00000001)

= 0101 0111 x 1000 0000 XOR 0101 0111 x 0000 0010 XOR

0101 0111 x 0000 0001

= 0011 1000 XOR 1010 1110 XOR 0101 0111

= 1100 0001 (Ans)

And that's how we get the answer.

AES Inverse Mix Column Calculation Example:

Understanding AES Mix-Columns Transformation Calculation [2] , except that

this time, it is the exact opposite round. In inverse mix column transformation, our 4x4

matrix is no longer

02 03 01 01

01 02 03 01

01 01 02 03

03 01 01 02

\

In inverse mix column transformation, we will be using this matrix instead:

0E 0B 0D 09

09 0E 0B 0D

0D 09 0E 0B

0B 0D 09 0E

Therefore our formula will be:

34

0E 0B 0D 09 04 D4

09 0E 0B 0D 66 BF

0D 09 0E 0B 81 = 5D

0B 0D 09 0E E5 30

Remember multiplication of matrix is always row x column. Therefore, we will

have the first result:

1. {0E.04} + {0B.66} + {0D.81} + {09.E5} = D4

2. {09.04} + {0E.66} + {0B.81} + {0D.E5} = BF

3. {0D.04} + {09.66} + {0E.81} + {0B.E5} = 5D

4. {0B.04} + {0D.66} + {09.81} + {0E.E5} = 30

compute the first two formulas. try on the last two formulas. So let's start with (1).

1. {0E.04} + {0B.66} + {0D.81} + {09.E5} = D4

Same as Mix-Columns Transformation, we will work in parts to arrive at our answers.

➢ 0E.04

First of all, convert the hex-decimal to binary.

0E = 0000 1110

04 = 0000 0100

We will now use the same way in the first part of the document, multiplication of bits.

For ease of computation, choose 0E as the split value.

0E = 0000 1110 = 0000 1000 XOR 0000 0100 XOR 0000 0010

0000 0100 x 0000 0001 = 0000 0100

0000 0100 x 0000 0010 = 0000 1000

0000 0100 x 0000 0100 = 0001 0000

0000 0100 x 0000 1000 = 0010 0000

Therefore, we can now compute 0E.04. Notice that never continue the next few

values like. It is not necessary to go to values that are not using to save you some time

in

exam.

0E.04 = {0000 0100 x 0000 1000} XOR {0000 0100 x 0000 0100}

35

XOR {0000 0100 x 0000 0010}

= 0000 1000 XOR 0001 0000 XOR 0010 0000

= 0011 1000

Do the same for the rest.

➢ 0B.66

0B = 0000 1011

66 = 0110 0110

0B = 0000 1000 XOR 0000 0010 XOR 0000 0001

0110 0110 x 0000 0001 = 0110 0110

0110 0110 x 0000 0010 = 1100 1100

0110 0110 x 0000 0100 = 1001 1000 XOR 0001 1011 = 1000 0011

0110 0110 x 0000 1000 = 0000 0110 XOR 0001 1011 = 0001 1101

0B.66 = {0000 1000 x 0110 0110} XOR {0000 0010 x 0110 0110}

XOR {0000 0001 x 0110 0110}

= 0001 1101 XOR 1100 1100 XOR 0110 0110

= 1011 0111

➢ 0D.81

0D = 0000 1101 = 0000 1000 XOR 0000 0100 XOR 0000 0001

81 = 1000 0001

1000 0001 x 0000 0001 = 1000 0001

1000 0001 x 0000 0010 = 0000 0010 XOR 0001 1011 = 0001 1001

1000 0001 x 0000 0100 = 0011 0010

1000 0001 x 0000 1000 = 0110 0100

0D.81 = {0000 1000 x 1000 0001 } XOR {0000 0100 x 1000 0001}

XOR {0000 0001 x 1000 0001}

= 0110 0100 XOR 0011 0010 XOR 1000 0001

= 1101 0111

➢ 09.E5

09 = 0000 1001 = 0000 1000 XOR 0000 0001

E5 = 1110 0101

1110 0101 x 0000 0001 = 1110 0101

36

1110 0101 x 0000 0010 = 1100 1010 XOR 0001 1011 = 1101 0001

1110 0101 x 0000 0100 = 1010 0010 XOR 0001 1011 = 1011 1001

1110 0101 x 0000 1000 = 0111 0010 XOR 0001 1011 = 0110 1001

09.E5 = {0000 1000 x 1110 0101} XOR {0000 0001 x 1110 0101}

= {0110 1001} XOR 1110 0101

= 1000 1100

Thus,

{0E.04} + {0B.66} + {0D.81} + {09.E5} = 0011 1000 XOR 1011 0111 XOR 1101 0111

XOR 1000 1100

= 1101 0100

= D4 (Shown)

2. {09.04} + {0E.66} + {0B.81} + {0D.E5} = BF

Do the exact same thing as the above.

➢ 09.04

09 = 0000 1001 = 0000 1000 XOR 0000 0001

04 = 0000 0100

0000 0100 x 0000 0001 = 0000 0100

0000 0100 x 0000 0010 = 0000 1000

0000 0100 x 0000 0100 = 0001 0000

0000 0100 x 0000 1000 = 0010 0000

09.04 = {0000 0100 x 0000 1000} XOR {0000 0100 x 0000 0001}

= 0010 0000 XOR 0000 0100

= 0010 0100

➢ 0E.66

0E = 0000 1110 = 0000 1000 XOR 0000 0100 XOR 0000 0010

66 = 0110 0110

0110 0110 x 0000 0001 = 0110 0110

0110 0110 x 0000 0010 = 1100 1100

0110 0110 x 0000 0100 = 1001 1000 XOR 0001 1011 = 1000 0011

0110 0110 x 0000 1000 = 0000 0110 XOR 0001 1011 = 0001 1101

0E.66 = {0110 0110 x 0000 1000} XOR {0110 0110 x 0000 0100}

37

XOR {0110 0110 x 0000 0010}

= 0001 1101 XOR 1000 0011 XOR 1100 1100

= 0101 0010

➢ 0B.81

0B = 0000 1011 = 0000 1000 XOR 0000 0010 XOR 0000 0001

81 = 1000 0001

1000 0001 x 0000 0001 = 1000 0001

1000 0001 x 0000 0010 = 0000 0010 XOR 0001 1011 = 0001 1001

1000 0001 x 0000 0100 = 0011 0010

1000 0001 x 0000 1000 = 0110 0100

0B.81 = {1000 0001 x 0000 1000} XOR {1000 0001 x 0000 0010}

XOR {1000 0001 x 0000 0001}

= 0110 0100 XOR 0001 1001 XOR 1000 0001

= 1111 1100

➢ 0D.E5

0D = 0000 1101 = 0000 1000 XOR 0000 0100 XOR 0000 0001

E5 = 1110 0101

1110 0101 x 0000 0001 = 1110 0101

1110 0101 x 0000 0010 = 1100 1010 XOR 0001 1011 = 1101 0001

1110 0101 x 0000 0100 = 1010 0010 XOR 0001 1011 = 1011 1001

1110 0101 x 0000 1000 = 0111 0010 XOR 0001 1011 = 0110 1001

0D.E5 = {1110 0101 x 0000 1000} XOR {1110 0101 x 0000 0100}

XOR {1110 0101 x 0000 0001}

= 0110 1001 XOR 1011 1001 XOR 1110 0101

= 0011 0101

Thus,

{09.04} + {0E.66} + {0B.81} + {0D.E5} = 0010 0100 XOR 0101 0010 XOR 1111 1100

XOR 0011 0101

= 1011 1111

= BF (Shown)

This is about it. Do the same for the rest and should be able to get all the values.

38

Plain text key plaintext

w[0,3]

W[4,7]

-

-

-

-

-

-

-

W[36,39]

W[40,43]

Cipher text ciphertext

FIGURE 2.19 OVERALL STRUCTURE OF AES

[5]

Add roundkey

Expand key

Ro

un

d 1

0

z

Add round key

Inverse subbytes

Inverse shift rows

R

ou

nd

1

subbytes

shiftrows

Mix columns

Add round key

R

ou

nd

10 Sub bytes

Shift rows

Add round key

R

ou

nd

9

subbytes

shiftrows

Mix columns

Add round key

R

ou

nd

1

Inverse mix columns

Add round key

Inverse sub bytes

Inverse shift rows

Add round key

Ro

un

d 9

Inverse mix cols

Inverse shift rows

Add round key

Inverse sub bytes

39

2.8 FPGA INTRODUCTION:

Field-programmable gate array (FPGA) is a semiconductor device that can be

configured by the customer or designer after manufacturing—hence the name "field-

programmable". To program an FPGA you specify how you want the chip to work with a

logic circuit diagram or a source code in a hardware description language (HDL).

FPGAs can be used to implement any logical function that an application-specific

integrated circuit (ASIC) could perform, but the ability to update the functionality after

shipping offers advantages for many applications.

FPGAs contain programmable logic components called "logic blocks", and a

hierarchy of reconfigurable interconnects that allow the blocks to be "wired together"—

somewhat like a one-chip programmable breadboard. Logic blocks can be configured to

perform complex combinational functions, or merely simple logic gates like AND and

XOR. In most FPGAs, the logic blocks also include memory elements, which may be

simple flip-flops or more complete blocks of memory.

For any given semiconductor process, FPGAs are usually slower than their fixed

ASIC counterparts. They also draw more power, and generally achieve less functionality

using a given amount of circuit complexity. But their advantages include a shorter time

to market, ability to re-program in the field to fix bugs, and lower non-recurring

engineering costs. Vendors can also take a middle road by developing their hardware

on ordinary FPGAs, but manufacture their final version so it can no longer be modified

after the design has been committed.

The historical roots of FPGAs are in complex programmable logic devices

(CPLDs) of the early to mid 1980s. A Xilinx co-founder, Ross Freeman, invented the

field programmable gate array in 1984. CPLDs and FPGAs include a relatively large

number of programmable logic elements. CPLD logic gate densities range from the

equivalent of several thousand to tens of thousands of logic gates, while FPGAs

typically range from tens of thousands to several million.

The primary differences between CPLDs and FPGAs are architectural. A CPLD

has a somewhat restrictive structure consisting of one or more programmable sum-of-

products logic arrays feeding a relatively small number of clocked registers. The result

of this is less flexibility, with the advantage of more predictable timing delays and a

higher logic-to-interconnect ratio. The FPGA architectures, on the other hand, are

dominated by interconnect. This makes them far more flexible (in terms of the range of

designs that are practical for implementation within them) but also far more complex to

design for.

Another notable difference between CPLDs and FPGAs is the presence in most

FPGAs of higher-level embedded functions (such as adders and multipliers) and

embedded memories, as well as to have logic blocks implements decoders or

mathematical functions.

http://en.wikipedia.org/wiki/Semiconductor

http://en.wikipedia.org/wiki/Field-programmable

http://en.wikipedia.org/wiki/Field-programmable

http://en.wikipedia.org/wiki/Circuit_diagram

http://en.wikipedia.org/wiki/Source_code

http://en.wikipedia.org/wiki/Hardware_description_language

http://en.wikipedia.org/wiki/Application-specific_integrated_circuit


http://en.wikipedia.org/wiki/Programmable_logic_device

http://en.wikipedia.org/wiki/Breadboard

http://en.wikipedia.org/wiki/Combinational_logic

http://en.wikipedia.org/wiki/Logic_gate

http://en.wikipedia.org/wiki/AND_gate

http://en.wikipedia.org/wiki/XOR_gate

http://en.wikipedia.org/wiki/Flip-flop_%28electronics%29

http://en.wikipedia.org/wiki/Semiconductor_fabrication_plant

http://en.wikipedia.org/wiki/Time_to_market

http://en.wikipedia.org/wiki/Time_to_market

http://en.wikipedia.org/wiki/Non-recurring_engineering

http://en.wikipedia.org/wiki/Non-recurring_engineering

http://en.wikipedia.org/wiki/CPLD

http://en.wikipedia.org/wiki/Ross_Freeman

http://en.wikipedia.org/wiki/Processor_register

40

Some FPGAs have the capability of partial re-configuration that lets one portion

of the device be re-programmed while other portions continue running.

A recent trend has been to take the coarse-grained architectural approach a step

further by combining the logic blocks and interconnects of traditional FPGAs with

embedded microprocessors and related peripherals to form a complete "system on a

programmable chip". This work mirrors the architecture by Ron Perlof and Hana Potash

of Burroughs Advanced Systems Group which combined a reconfigurable CPU

architecture on a single chip called the SB24. That work was done in 1982. Examples of

such hybrid technologies can be found in the Xilinx Virtex-II PRO and Virtex-4 devices,

which include one or more PowerPC processors embedded within the FPGA's logic

fabric. The Atmel FPSLIC is another such device, which uses an AVR processor in

combination with Atmel's programmable logic architecture.

An alternate approach to using hard-macro processors is to make use of "soft"

processor cores that are implemented within the FPGA logic. (See "Soft processors"

below).

As previously mentioned, many modern FPGAs have the ability to be reprogrammed at

"run time," and this is leading to the idea of reconfigurable computing or reconfigurable

systems — CPUs that reconfigure themselves to suit the task at hand. The Mitrion

Virtual Processor from Mitrionics is an example of a reconfigurable soft processor that is

implemented on FPGAs. It does not however support dynamic reconfiguration at

runtime, but instead adapts itself to a specific program.

Additionally, new, non-FPGA architectures are beginning to emerge. Software-

configurable microprocessors such as the Stretch S5000 adopt a hybrid approach by

providing an array of processor cores and FPGA-like programmable cores on the same

chip.

Applications of FPGAs include digital signal processing, software-defined radio,

aerospace and defense systems, ASIC prototyping, medical imaging, computer vision,

speech recognition, cryptography, bioinformatics, computer hardware emulation and a

growing range of other areas. FPGAs originally began as competitors to CPLDs and

competed in a similar space, that of glue logic for PCBs. As their size, capabilities, and

speed increased, they began to take over larger and larger functions to the state where

some are now marketed as full systems on chips (SOC).

FPGAs especially find applications in any area or algorithm that can make use of

the massive parallelism offered by their architecture. One such area is code breaking, in

particular brute-force attack, of cryptographic algorithms.

FPGAs are increasingly used in conventional High Performance Computing

applications where computational kernels such as FFT or Convolution are performed on

the FPGA instead of a microprocessor. The use of FPGAs for computing tasks is known

as reconfigurable computing.

http://en.wikipedia.org/wiki/Partial_re-configuration

http://en.wikipedia.org/wiki/PowerPC

http://en.wikipedia.org/wiki/Atmel_AVR#Basic_Families

http://en.wikipedia.org/wiki/Atmel_AVR

http://en.wikipedia.org/wiki/Semiconductor_intellectual_property_core

http://en.wikipedia.org/wiki/Reconfigurable_computing

http://en.wikipedia.org/wiki/Central_processing_unit

http://en.wikipedia.org/w/index.php?title=Mitrion_Virtual_Processor&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Mitrion_Virtual_Processor&action=edit&redlink=1

http://en.wikipedia.org/wiki/Mitrionics

http://en.wikipedia.org/wiki/Microprocessor

http://en.wikipedia.org/w/index.php?title=Stretch_S5000&action=edit&redlink=1

http://en.wikipedia.org/wiki/Digital_signal_processing

http://en.wikipedia.org/wiki/Software-defined_radio

http://en.wikipedia.org/wiki/Aerospace

http://en.wikipedia.org/wiki/Defense_%28military%29


http://en.wikipedia.org/wiki/Medical_imaging

http://en.wikipedia.org/wiki/Computer_vision

http://en.wikipedia.org/wiki/Speech_recognition

http://en.wikipedia.org/wiki/Cryptography

http://en.wikipedia.org/wiki/Bioinformatics

http://en.wikipedia.org/wiki/Emulator

http://en.wikipedia.org/wiki/CPLD

http://en.wikipedia.org/wiki/Glue_logic

http://en.wikipedia.org/wiki/Printed_circuit_board

http://en.wikipedia.org/wiki/System-on-a-chip

http://en.wikipedia.org/wiki/Brute-force_attack

http://en.wikipedia.org/wiki/High_Performance_Computing

http://en.wikipedia.org/wiki/FFT

http://en.wikipedia.org/wiki/Convolution

http://en.wikipedia.org/wiki/Microprocessor


41

The inherent parallelism of the logic resources on the FPGA allows for

considerable compute throughput even at a sub-500 MHz clock rate. For example, the

current (2007) generation of FPGAs can implement around 100 single precision floating

point units, all of which can compute a result every single clock cycle. The flexibility of

the FPGA allows for even higher performance by trading off precision and range in the

number format for an increased number of parallel arithmetic units. This has driven a

new type of processing called reconfigurable computing, where time intensive tasks are

offloaded from software to FPGAs.

The adoption of FPGAs in high performance computing is currently limited by the

complexity of FPGA design compared to conventional software and the extremely long

turn-around times of current design tools, where 4-8 hours wait is necessary after even

minor changes to the source code

2.8.1 FPGA ARCHITECTURE:

The typical basic architecture consists of an array of configurable logic blocks

(CLBs) and routing channels. Multiple I/O pads may fit into the height of one row or the

width of one column in the array. Generally, all the routing channels have the same

width (number of wires).

An application circuit must be mapped into an FPGA with adequate resources.

A classic FPGA logic block consists of a 4-input lookup table (LUT), and a flip-

flop, as shown below. In recent years, manufacturers have started moving to 6-input

LUTs in their high performance parts, claiming increased performance.

FIGURE 2.20 TYPICAL FPGA LOGIC BLOCK

There is only one output, which can be either the registered or the unregistered LUT

output. The logic block has four inputs for the LUT and a clock input. Since clock signals

(and often other high-fanout signals) are normally routed via special-purpose dedicated

routing networks in commercial FPGAs, they and other signals are separately managed.

http://en.wikipedia.org/wiki/Floating_point

http://en.wikipedia.org/wiki/Floating_point


http://en.wikipedia.org/wiki/Source_code

http://en.wikipedia.org/w/index.php?title=I/O_pad&action=edit&redlink=1

http://en.wikipedia.org/wiki/Lookup_table



http://en.wikipedia.org/wiki/Fanout

42

For this example architecture, the locations of the FPGA logic block pins are shown

below.

FIGURE2.21 FPGA LOGIC BLOCK PIN LOCATION

Each input is accessible from one side of the logic block, while the output pin can

connect to routing wires in both the channel to the right and the channel below the logic

block. Each logic block output pin can connect to any of the wiring segments in the

channels adjacent to it.

Similarly, an I/O pad can connect to any one of the wiring segments in the

channel adjacent to it. For example, an I/O pad at the top of the chip can connect to any

of the W wires (where W is the channel width) in the horizontal channel immediately

below it.

Generally, the FPGA routing is unsegmented. That is, each wiring segment

spans only one logic block before it terminates in a switch box. By turning on some of

the programmable switches within a switch box, longer paths can be constructed. For

higher speed interconnect, some FPGA architectures use longer routing lines that span

multiple logic blocks.

Whenever a vertical and a horizontal channel intersect, there is a switch box. In

this architecture, when a wire enters a switch box, there are three programmable

switches that allow it to connect to three other wires in adjacent channel segments. The

pattern, or topology, of switches used in this architecture is the planar or domain-based

switch box topology. In this switch box topology, a wire in track number one connects

only to wires in track number one in adjacent channel segments, wires in track number

2 connect only to other wires in track number 2 and so on. The figure below illustrates

the connections in a switch box.

Modern FPGA families expand upon the above capabilities to include higher level

functionality fixed into the silicon. Having these common functions embedded into the

silicon reduces the area required and gives those functions increased speed compared

to building them from primitives. Examples of these include multipliers, generic DSP

blocks, embedded processors, high speed IO logic and embedded memories.

43

FIGURE 2.22 FPGA SWITCH BOX TOPOLOGY

FPGAs are also widely used for systems validation including pre-silicon

validation, post-silicon validation, and firmware development. This allows chip

companies to validate their design before the chip is produced in the factory, reducing

the time to market.

2.8.2 FPGA DESIGN AND PROGRAMMING:

To define the behaviour of the FPGA, the user provides a hardware description

language (HDL) or a schematic design. The HDL form might be easier to work with

when handling large structures because it's possible to just specify them numerically

rather than having to draw every piece by hand. On the other hand, schematic entry can

allow for easier visualisation of a design.

Then, using an electronic design automation tool, a technology-mapped net list is

generated. The net list can then be fitted to the actual FPGA architecture using a

process called place-and-route, usually performed by the FPGA Company‘s proprietary

place-and-route software. The user will validate the map, place and route results via

timing analysis, simulation, and other verification methodologies. Once the design and

validation process is complete, the binary file generated (also using the FPGA

company's proprietary software) is used to (re)configure the FPGA.

Going from schematic/HDL source files to actual configuration: The source files

are fed to a software suite from the FPGA/CPLD vendor that through different steps will

produce a file. This file is then transferred to the FPGA/CPLD via a serial interface

(JTAG) or to an external memory device like an EEPROM.

The most common HDLs are VHDL and Verilog, although in an attempt to reduce

the complexity of designing in HDLs, which have been compared to the equivalent of



http://en.wikipedia.org/wiki/Schematic

http://en.wikipedia.org/wiki/Electronic_design_automation

http://en.wikipedia.org/wiki/Netlist

http://en.wikipedia.org/wiki/Place_and_route

http://en.wikipedia.org/wiki/Timing_analysis

http://en.wikipedia.org/wiki/Simulation

http://en.wikipedia.org/wiki/Verification

http://en.wikipedia.org/wiki/Serial_communication

http://en.wikipedia.org/wiki/Joint_Test_Action_Group

http://en.wikipedia.org/wiki/EEPROM

http://en.wikipedia.org/wiki/VHDL

http://en.wikipedia.org/wiki/Verilog

44

assembly languages, there are moves to raise the abstraction level through the

introduction of alternative languages.

To simplify the design of complex systems in FPGAs, there exist libraries of predefined

complex functions and circuits that have been tested and optimized to speed up the

design process. These predefined circuits are commonly called IP cores, and are

available from FPGA vendors and third-party IP suppliers (rarely free, and typically

released under proprietary licenses). Other predefined circuits are available from

developer communities such as Open Cores (typically free, and released under the

GPL, BSD or similar license), and other sources.

In a typical design flow, an FPGA application developer will simulate the design

at multiple stages throughout the design process. Initially the RTL description in VHDL

or Veri log is simulated by creating test benches to simulate the system and observe

results. Then, after the synthesis engine has mapped the design to a net list, the net list

is translated to a gate level description where simulation is repeated to confirm the

synthesis proceeded without errors. Finally the design is laid out in the FPGA at which

point propagation delays can be added and the simulation run again with these values

back-annotated onto the net list.

Field-programmable gate arrays (FPGAs) arrived in 1984 as an alternative to

programmable logic devices (PLDs) and ASICs. As their name implies, FPGAs offer the

significant benefit of being readily programmable. Unlike their fore bearers in the PLD

category, FPGAs can (in most cases) be programmed again and again, giving

designers multiple opportunities to tweak their circuits.

There‘s no large non-recurring engineering (NRE) cost associated with FPGAs.

In addition, lengthy, nerve wracking waits for mask-making operations are squashed.

Often, with FPGA development, logic design begins to resemble software design due to

the many iterations of a given design. Innovative design often happens with FPGAs as

an implementation platform.

But there are some downsides to FPGAs as well. The economics of FPGAs force

designers to balance their relatively high piece-part pricing compared to ASICs with the

absence of high NREs and long development cycles. They‘re also available only in fixed

sizes, which matters when you‘re determined to avoid unused silicon area.

http://en.wikipedia.org/wiki/Assembly_language

http://en.wikipedia.org/wiki/Hardware_description_language#HDL_and_programming_languages

http://en.wikipedia.org/wiki/Semiconductor_intellectual_property_core

http://en.wikipedia.org/wiki/OpenCores

http://en.wikipedia.org/wiki/Free_software

http://en.wikipedia.org/wiki/GNU_General_Public_License

http://en.wikipedia.org/wiki/BSD_license

http://en.wikipedia.org/wiki/Register_transfer_level

http://en.wikipedia.org/wiki/VHDL

http://en.wikipedia.org/wiki/Verilog

45

TABLE 2.6 DO’S AND DON’TS FOR THE FPGA DESIGNER

FPGAs fill a gap between discreet logic and the smaller PLDs on the low end of

the complexity scale and costly custom ASICs on the high end. They consist of an array

of logic blocks that are configured using software. Programmable I/O blocks surround

these logic blocks. Both are connected by programmable interconnects.

The programming technology in an FPGA determines the type of basic logic cell

and the interconnect scheme. In turn, the logic cells and interconnection scheme

determine the design of the input and output circuits as well as the programming

scheme. Just a few years ago, the largest FPGA was measured in tens of thousands of

system gates and operated at 40 MHz. Older FPGAs often cost more than $150 for the

most advanced parts at the time. Today, however, FPGAs offer millions of gates of logic

capacity, operate at 300 MHz, can cost less than $10, and offer integrated functions

like processors and memory.

2.8.3 FPGA USAGE THEN AND NOW:

Fifteen years ago, FPGA‘s were designed into systems primarily to reduce system

component costs by consolidating board-level logic into fewer devices. A few hundred

logic gates were replaced by a single FPGA that implemented the same functionality.

Because the role of an FPGA was the consolidation of board-level logic into fewer

components. Availability of tools was a major bottleneck then.

46

The complexity of today‘s FPGA‘s, allows system architects to replace a broad

range of ASICs with FPGA‘s and further consolidate and integrate the system logic into

fewer and fewer components. Today‘s FPGA‘s provide unprecedented flexibility at

attractive costs.

FIGURE 2.23 FPGA DEVICE COMPLEXCITY

The advantage of

No NRE costs.

Easy design modification.

In-System Re-Programmability.

Easy off-the-shelf availability in small volumes and

Fast time to Market.

Make FPGA‘s a very attractive alternative to ASICs.

In the earlier days of FPGA‘s, users were primarily board design engineers and

their applications were limited to using FPGA‘s to integrate and consolidate 100s of

gates of board-level logic. To do so, they used their time-tested board design

methodology anchored by schematic capture. They would do the logic minimization

manually and capture the design at the Boolean logic level.

Hence the only additional tool needed was a fitter ( FPGA place and route tool) to

map the logic design into the FPGA with correct routing. The fitter then generated a

chip-programming file. Verification often consisted of actual prototyping.

47

New FPGA and Old ASIC designers are moving to fully exploit the new generation of

FPGA devices by replacing more ASICs with FPGA‘s. As FPGA‘s replace ASICs, FPGA

design is moving from board engineers into chip design teams. Adopting HDL design

methods helps meet their tight time-to-market requirements while implementing designs

into ever larger and more complex FPGA‘s. The combination of these FPGA technology

and design trends increase the need for FPGA design solutions that provide tools

powerful enough to handle ASIC designs while also delivering the productivity of an

integrated FPGA design flow

Fifteen years later, the thought of designing a million-gate FPGA using a schematic

design methodology defies rational thought.

Today‘s FPGA designers are adopting HDL-based design methods at

astonishing rates. HDL-based design, have increased productivity by allowing the

designer to work at higher levels of abstraction — the Register-Transfer Level instead of

Gate level.

Central to shift-over to HDL-based designs coupled with increased size of FPGA‘s, are

two strategically important tools:

Simulation for design verification.

Synthesis for automatic implementation of RTL design to the Gate-level.

Breadboard prototyping has fallen apart as a practical design verification method,

due to the cost of debugging functionality after layout.

Simulation allows design problems to be discovered earlier when it is more cost-

effective to fix them.

Schematics and block diagrams still have a limited role in FPGA design. Their

role has been limited to manual implementation of tightly constrained functional blocks

or to help manage complexity by graphically partitioning the design into smaller blocks.

FPGA Also known as:

LCA (Logic Cell Array)

pASIC (programmable ASIC)

FLEX, APEX (Altera)

ACT (Actel)

ORCA (Lucent)

Virtex (Xilinx)

pASIC (QuickLogic)

48

A generic description of an FPGA is a programmable device with an internal

array of logic blocks, surrounded by a ring of programmable input/output blocks,

connected together via programmable interconnect. There are a wide variety of sub-

architectures within this group. The secret to density and performance in these devices

lies in the logic contained in their logic blocks and on the performance and efficiency of

their routing architecture.

FPGAs are a distinct from SPLDs and CPLDs and typically offer the highest logic

capacity. A typical FPGA contains from 64 to tens of thousands of logic blocks and an

even greater number of flip-flops. Most FPGAs do not provide 100% interconnect

between logic blocks, to do so would be prohibitively expensive in terms of area.

FIGURE 2.24 ARCHETECHTURE OF FPGA

49

CHAPTER 3

IMPLEMETATION

In this chapter method to implementation of the advanced encryption process-

Rijndeal algorithm was discussed. All the implementation is of 128 bit key

3.1.AES ENCRYPTION PROCESS:

3.1.1 ENCRYPTION IMPLEMENTATION:

VHDL is used as the hardware description language because of the flexibility to

exchange among environments. The code is pure VHDL that could easily be

implemented on other devices, without changing the design. The software used for this

work is Xilinx ISE 8.1i. This is used for writing, debugging and optimizing efforts, and

also for fitting, simulating and checking the performance results using the simulation

tools available on Xilinx ISE design software.

3.1.2 STEPS FOLLOWED IN ENCRYPTION PROCESS:

yes

no

FIGURE 3.1:FLOW CHART REPRESENTATION FOR AES ENCRYPTION PROCESS

i >nr

Subbytes(state)

shiftrows(state)

Addroundkey(state,roundkey)

i=i+1

Addroundkey (state, roundkey)

i=Nr

Ke

y S

ch

ed

ule

subbytes (state)

shiftrows (state)

mixcolumns (state)

Addroundkey(State,Roundkey)

50

3.1.3. ADD ROUND KEY:

Add round key is an XOR between the state and the round key. this

transformation is its own inverse. AES operation-add round key

Each byte of the round key is XORed with the corresponding byte in the state

table.inverse operation is identical since XOR a second time returns the original

values XOR each byte of the roundkey with the state table def

addroundkey(state,roundkey): for i in range(len(state)):state[i]^roundkey[i]

3.1.4.SUB BYTES:

sub byte is a substitute of each byte in the block independent of the

position in the state. This is an s-box. This is the non-linear transformation. The s-box

used is proved to be optimal with regards to non-linearity. The s-box is based on

arithmetic in GF(28).

AES operation –sub bytes

Each byte of the state table is substituted with the values in the s-box whose

index is the value of the state table byte. Provide non-linearity(algorithm not equal to

the sum of its parts)

3.1.5. SHIFT ROWS:

Shift rows is a cyclic shift of the bytes in the rows in the state and is clearly

invertible (by a shift in the opposite direction by the same amount).

AES operation –shift rows

Each row in the state table is shifted left by the number of bytes represented by

the row number.

3.1.6. MIX COLUMNS:

Linear mixing layer (shift row and mix column) which guarantees high diffusion.

Non linear s boxes protects against linear and differential cryptanalysis.

AES operation –mix columns

Mix columns is performed by multiplying each column (within the galois finite

filed).

51

3.1.7. KEY EXPANSION:

AES-expansion operations

AES key expansion consists of several primitive operations:

Rotate – takes a 4-byte word and rotates everything one byte to the left,

e.g.rotate([1,2,3,4])[2,3,4,1]

Sub bytes-each byte of the word is substituted with the value in the s-box whose

index is the value of the original byte

Rcon-the first byte of a word is XORed with the round constant. Each

value of the rcon table is a member of the rijndael finite field.

3.1.8.KEY SCHEDULE CORE:

This operation is used as an inner loop in the key schedule, and is done thus:

The input is a 32-bit word and an iteration number i. The output is a 32-bit word.

Copy the input over to the output.

Use the above described rotate operation to rotate the output eight bits to the left.

Apply rijndael‘s s-box on all four individual bytes in the output word

On just the first(leftmost) byte of the output word,exclusive or the byte with 2 to

the power of(i-1). In other words,perform the rcon operation with i as the input

and exclusive or the rcon output with the first byte of the output word.

3.2. DECRYPTION IMPLEMENTATION:

The decryption implementation results are similar to the encryption

implementation. The key schedule generation module is modified in the reverse order.

In which last round key is treated as the first round and decreasing order follows.

52

FIGURE 3.2:FLOW CHART REPRESENTATION FOR AES DECRYPTION PROCESS

3.2.1.INVERSE SHIFT ROWS:

Inverse operation simply shifts each row to the right by the number of bytes as

the row number.

3.2.1.INVERSE SUB BYTES:

Inverse operation is performed using the inverted s-box.

3.2.1.INVERSE MIX COLUMNS:

The inverse operation is performed by multiplying each column by the following

inverse matrix.

i >1

Invsubbytes(state)

Invshiftrows(state)

Addroundkey(state,roundkey)

i=i-1

Addroundkey (state, roundkey)

i=Nr

K

ey S

ch

ed

ule

Invsubbytes (state)

Invshiftrows (state)

Invmixcolumns (state)

Addroundkey(State,Roundkey)

53

3.2.4. KEY SCHEDULE DESCRIPTION:

Rijndael‘s key schedule is done as follows:

1. The first n bytes of the expanded key are simply the encryption key.

2. The rcon iteration value i is set to 1

3. Until we have b byte s of expanded key. We do the following to generate n

more bytes of expanded key:

We do the following to create 4-bytes of expanded key:

1. We create a 4-bytes temporary variable,t

2. We assign the value of the previous four bytes in the expanded key to t

3. We perform the key schedule core(see above)on t, with i as the rcon iteration

value

4. We increment i by 1

5. We exclusive-or t with four-byte block n bytes before the new expanded key.

this becomes the next 4-bytes in the expanded key

We then do the following three times to create the next twelve bytes of expanded key:

1. We assign the values of the previous 4 bytes in the expanded key to t

2. We exclusive-or t with the four-byte block n bytes before the new expanded

key. This becomes the next 4 bytes in the expanded key

If we are generating a 256-bit key, we do the following to generate the next 4 bytes of

expanded key:

1. We assign the value of the previous four bytes in the expanded key to t

2. We run each of the 4 bytes in t through rijndael‘s s-box

3. We exclusive-or t with the 4-byte block 32 bytes before the new expanded key.

This becomes the next 4 bytes in the expanded key.

If we are generating a 128-bit key, we do not perform the following steps. If we

are generating a 192-bit key, we run the following steps twice. If we are generating a

256-bit key, we run the following steps three times

We assign the values of the previous 4 bytes in the expanded key to t

1. We exclusive-or t with four-byte block n bytes before the new expanded key.

This becomes the next 4 bytes in the expanded key.

3.3 CONSTANTS:

Since the key schedule for 128-bit, 192-bit, and 256-bit encryption are very

similar, with only some constants changed, the following key size constants are defined

here

N has a value of 16 for 128-bit keys,24 for 192-bit keys, and 32 for 256-bit keys

has a value of 176 for 128-bit keys,208 for 192-bit keys, and 240 for 256-bit keys.

54

3.4 HARDWARE IMPLEMENTATION:

In this project , the hardware implementation is done in Spartan 3E fpga starter

kit.

3.4.1 SPARTAN-3E FPGA FEATURES AND EMBEDDED PROCESSING

FUNCTIONS:

The Spartan-3E Starter Kit board highlights the unique features of the Spartan-

3E FPGA family and provides a convenient development board for embedded

processing applications. The board highlights these features:

• Spartan-3E specific features

• Parallel NOR Flash configuration

• Multi Boot FPGA configuration from Parallel NOR Flash PROM

• SPI serial Flash configuration

• Embedded development

• Micro Blaze™ 32-bit embedded RISC processor

• Pico Blaze™ 8-bit embedded controller

• DDR memory interfaces

Key Components and Features:

The key features of the Spartan-3E Starter Kit board are:

• Xilinx XC3S500E Spartan-3E FPGA

• Up to 232 user-I/O pins

• 320-pin FBGA package

• Over 10,000 logic cells

• Xilinx 4 Mbit Platform Flash configuration PROM

• Xilinx 64-macrocell XC2C64A CoolRunner CPLD

• 64 MByte (512 Mbit) of DDR SDRAM, x16 data interface, 100+ MHz

• 16 MByte (128 Mbit) of parallel NOR Flash (Intel StrataFlash)

• FPGA configuration storage

• MicroBlaze code storage/shadowing

• 16 Mbits of SPI serial Flash (STMicro)

• FPGA configuration storage

• MicroBlaze code shadowing

• 2-line, 16-character LCD screen

• PS/2 mouse or keyboard port

• VGA display port

• 10/100 Ethernet PHY (requires Ethernet MAC in FPGA)

• Two 9-pin RS-232 ports (DTE- and DCE-style)

• On-board USB-based FPGA/CPLD download/debug interface

• 50 MHz clock oscillator

• SHA-1 1-wire serial EEPROM for bitstream copy protection

• Hirose FX2 expansion connector

• Three Digilent 6-pin expansion connectors

55

• Four-output, SPI-based Digital-to-Analog Converter (DAC)

• Two-input, SPI-based Analog-to-Digital Converter (ADC) with programmable-

gain pre-amplifier

• ChipScope™ SoftTouch debugging port

• Rotary-encoder with push-button shaft

• Eight discrete LEDs

3.4.2 CHARACTER LCD SCREEN :

The spartan-3e starter kit board prominently features a 2-line by 16-character liquid

crystal display (lcd). The fpga controls the lcd via the 4-bit data interface shown in figure 3-

1. Although the lcd supports an 8-bit data interface, the starter kit board uses a4-bit data

interface to remain compatible with other xilinx development boards and tominimize total

pin count.

FIGURE 3.4: CHARACTER LCD INTERFACE

Once mastered, the lcd is a practical way to display a variety of information using

standard ascii and custom characters. However, these displays are not fast. Scrolling the

display at half-second intervals tests the practical limit for clarity. Compared with the 50

mhz clock available on the board, the display is slow. A picoblaze processor efficiently

controls display timing plus the actual content of the display.

3.4.3CHARACTER LCD INTERFACE SIGNALS :

56

Table 3.4 Shows The Interface Character LCD Interface Signals.

TABLE 3.4: CHARACTER LCD INTERFACE

3.4.4 VOLTAGE COMPATIBILITY :

The character LCD is power by +5v. The FPGA i/o signals are powered by

3.3v.however, the FPGA‘S output levels are recognized as valid low or high logic levels by

the LCD. The LCD controller accepts 5v TTL signal levels and the 3.3v LVCMOS outputs

provided by the FPGA meet the 5v TTL voltage level requirements.

The 390Ʌ series resistors on the data lines prevent overstressing on the FPGA and

STRATAFLASH I/O pins when the character LCD drives a high logic value. The character

LCD drives the data lines when LCD_RW is high. Most applications treat the lcd as a

write-only peripheral and never read from the display.

INTERACTION WITH INTEL STRATAFLASH :

As shown in figure 5-1, the four LCD data signals are also shared with

STRATAFLASH data lines SF_D<11:8>. As shown in table 5-2, the LCD/STRATAFLASH

interaction depends on the application usage in the design. When the STRATAFLASH

memory is disabled (sf_ce0 = high), then the FPGA application has full read/write access

to the LCD. Conversely, when LCD read operations are disabled (LCD_RW = low), then

the fpga application has full read/write access to the STRATAFLASH memory TABLE 3-2: LCD/STRATAFLASH CONTROL INTERACTION

57

Note : ‗X‘ Indicates A Don‘t Care, Can Be Either 0 Or 1

If the STRATAFLASH memory is in byte-wide (x8) mode (SF_BYTE = low), the

FPGA. Application has full simultaneous read/write access to both the LCD and the

STRATAFLASH memory. In byte-wide mode, the STRATAFLASH memory does not use

the SF_D<15:8> data lines.

UCF Location Constraints :

Figure 3.2 Provides The UCF Constraints For The Character LCD, Including The

I/O Pin Assignment And The I/O Standard Used.

NET ―LCD_E‖ LOC =‖M18‖ | IOSTANDARD=LVCMOS33 | DRIVE 4 | SLEW = SLOW ;

NET ―LCD_RS‖ LOC =‖L18‖| IOSTANDARD=LVCMOS33 | DRIVE 4 | SLEW = SLOW ;

NET ―LCD_RW‖LOC =‖L17‖ | IOSTANDARD=LVCMOS33 | DRIVE 4 | SLEW = SLOW ;

# The Lcd 4-Bit Data Interface Is Shared With The Strata Flash

NET ―SF_D<8>‖ LOC =‖R15‖| IOSTANDARD=LVCMOS33 | DRIVE 4 | SLEW = SLOW ;

NET ―SF_D<9>‖ LOC =‖R16‖| IOSTANDARD=LVCMOS33 | DRIVE 4 | SLEW = SLOW ;

NET ―SF_D<10>‖ LOC =‖P17‖| IOSTANDARD=LVCMOS33|DRIVE 4 | SLEW = SLOW ;

NET ―SF_D<D11>‖LOC =‖M18‖|IOSTANDARD=LVCMOS33 |DRIVE 4|SLEW = SLOW ;

3.5 LCD CONTROLLER :

The 2 x 16 character lcd has an internal SITRONIX st7066u graphics controller that is

functionally equivalent with the following devices.

Samsung S6A0069X or KS0066U

Hitachi HD44780

SMOS SED1278

SF_CEO SF_BYTE LCD_RW OPERATION

1 X X Strata Flash Disabled.Full Read/Write Access To

LCD

X X 0 LCD Write Access Only.Full Access To Strata Flash

X

0

X

Strara Flash In Byte – Wide (X8) Mode.Upper Data

Lines Are Not Used.Full Access To Both Lcd And

Strata Flash

58

3.5.1 MEMORY MAP :

The controller has three internal memory regions, each with a specific purpose. The

display must be initialized before accessing any of these memory regions.

3.5.2 DD RAM :

The display data RAM (DD RAM) stores the character code to be displayed on the

screen. Most applications interact primarily with DD RAM. The character code stored in a

DD RAM location references a specific character bitmap stored either in the predefined

CG ROM character set or in the user-defined cg ram character set.

Figure 3.2 shows the default address for the 32 character locations on the display.

The upper line of characters is stored between addresses 0x00 and 0x0f. The second line

of characters is stored between addresses 0x40 and 0x4f.

FIGURE 3.5: DD RAM HEXADECIMAL ADDRESSES (NO DISPLAY SHIFTING)

Physically, there are 80 total character locations in dd ram with 40 characters

available per line. Locations 0x10 through 0x27 and 0x50 through 0x67 can be used to

store other non-display data. Alternatively, these locations can also store characters that

can only displayed using controller‘s display shifting functions.

The set DD RAM address command initializes the address counter before reading

or writing to DD RAM. Write DD RAM data using the write data to cg ram or DD RAM

command, and read DD RAM using the read data from cg ram or DD RAM command.

The DD RAM address counter either remains constant after read or writes

operations, or auto-increments or auto-decrements by one location, as defined by the i/d

set by the entry mode set command.

3.5.3 CG ROM:

The character generator ROM (CG ROM) contains the font bitmap for each of the

predefined characters that the LCD screen can display, shown in figure 3.3. The character

code stored in DD RAM for each character location subsequently references a position

with the cg rom. For example, a hexadecimal character code of 0x53 stored in a DD RAM

location displays the character‗s‘. The upper nibble of 0x53 Equates To

59

DB[7:4]=‖0101‖binary and the lower nibble equates to DB[3:0] = ―0011‖ binary. As shown

in figure 3.3, the character ‗s‘ appears on the screen.english/roman characters are stored

in CG ROM at their equivalent ASCII code address.

The character ROM contains the ASCII English character set and Japanese kana

characters. The controller also provides for eight custom character bitmaps, stored in CG

RAM. These eight custom characters are displayed by storing character codes 0x00

through 0x07 in add ram location.

FIGURE 3.5.1: LCD CHARACTER SET

3.5.4 CG RAM :

The character generator ram (CG RAM) provides space to create eight custom

character bitmaps. Each custom character location consists of a 5-dot by 8-line bitmap, as

shown in figure 3.4.

The set cg ram address command initializes the address counter before reading or

writing to CG RAM. Write CG RAM data using the write data to cg ram or DD RAM

command, and read cg ram using the read data from cg ram or dd ram command.

The CG RAM address counter can either remain constant after read or write

operations, or auto-increments or auto-decrements by one location, as defined by the i/d

60

set by the entry mode set command.

Figure 3.4 provides an example, creating a special checkerboard character. The

custom character is stored in the fourth CG RAM character location, which is displayed

when a DD RAM location is 0x03. To write the custom character, the cg ram address is

first initialized using the set cg ram address command. The upper three address bits point

to the custom character location.

The lower three address bits point to the row address for the character bitmap. The

write data to cg ram or DD RAM command is used to write each character bitmap row. A

‗1‘ lights a bit on the display. A ‗0‘ leaves the bit unlit. Only the lower five data bits are

used; the upper three data bits are don’t care positions. The eighth row of bitmap data is

usually left as all zeros to accommodate the cursor.

Figure 3.4: Example Custom Checkerboard Character With Character Code 0x03

3.6 COMMAND SET :

Table 3.5 Summarizes The Available LCD Controller Commands And Bit

Definitions. Because The Display Is Set Up For 4-Bit Operation, Each 8-Bit Command Is

Sent As Two 4-Bit Nibbles. The Upper Nibble Is Transferred First, Followed By The Lower

Nibble. TABLE 3.5: LCD CHARACTER DISPLAY COMMAND SET

61

3.6.1 DISABLED :

If The LCD_E Enable Signal Is Low, All Other Inputs To The LCD Are Ignored .

3.6.2 CLEAR DISPLAY :

Clear the display and return the cursor to the home position, the top-left corner.

This command writes a blank space (ASCII/ANSI character code 0x20) into all DD RAM

addresses. The address counter is reset to 0, location 0x00 in DD RAM. Clears all option

settings. The i/d control bit is set to 1 (increment address counter mode) in the entry mode

set command. Execution time: 82µs – 1.64 ms.

3.6.3 RETURN CURSOR HOME :

Return the cursor to the home position, the top-left corner. DD RAM contents are

unaffected. Also returns the display being shifted to the original position, shown in figure

3.2.the address counter is reset to 0, location 0x00 in DD RAM. The display is returned to

its original status if it was shifted. The cursor or blink move to the top-left character

62

location.execution time: 40µs – 1.6 ms

3.7 ENTRY MODE SET :

Sets the cursor move direction and specifies whether or not to shift the display . . . .

these operations are performed during data reads and writes. Execution time: 40µs

.

3.7.1 Bit DB1: (I/D) Increment/Decrement

0 Auto – Decrement Address Counter. Cursor/Blink Moves To Left

1 Auto Increment Address Counter . Cursor/Blink Moves To Right

This bit either auto-increments or auto-decrements the DD RAM and CG RAM

address counter by one location after each write data to cg ram or DD RAM or read data

from CG RAM or DD RAM command. The cursor or blink position moves accordingly.

3.7.2 Bit DB0: (S) Shift

0

Shifting Disabled

1

During A DD RAM Write Operation,Shift The Entire Display Value In The

Direction Controlled By Bit DBI (I/D). Appears As Though The Cursor

Position Remains Constant And The Display Moves.

3.8 DISPLAY ON/OFF

Display is turned on or off, controlling all characters, cursor and cursor position

character (underscore) blink. Execution time: 40µs.

3.8.1 Bit DB2: (D) Display On/Off

0 No Characters Displayed.However,Data Stored In DDRAM Is Retained

1 Display characters stored in DDRAM

3.8.2 Bit DB1: (C) Cursor On/Off :

The cursor uses the five dots on the bottom line of the character. The cursor

appears as a line under the displayed character.

63

0 No Cursor

1 Display Cursor

3.8.3 Bit DB0: (B) Cursor Blink On/Off :

0

No Cursor Blinking

1

Cursor Blinks On And Off Approximately Every Half Second

3.9 CURSOR AND DISPLAY SHIFT

Moves the cursor and shifts the display without changing DD RAM contents. Shift

cursor position or display to the right or left without writing or reading display data.

This function positions the cursor in order to modify an individual character, or to

scroll the display window left or right to reveal additional data stored in the DD RAM,

beyond the 16th character on a line. The cursor automatically moves to the second line

when it shifts beyond the 40th character location of the first line. The first and second line

displays shift at the same time. When the displayed data is shifted repeatedly, both lines

move horizontally. The second display line does not shift into the first display line.

Execution time: 40µs .

Db3(S/C) Db2(R/L) Operation

0

0

SHIFT THE CURSOR POSITION ON THE Left . The Address

Counterr Is Decrement By One

0

1

Shift The Cursor Position To The Right . The Address Counter

Is Increment By One .

1

0

Shift The Entire Display To The Left . The Cursor Follows The

Display Shift . The Address Counter Is Unchanged .

1

1

Shift The Entire Display To The Right . The Cursor Follows The

Display Shift . The Address Counter Is Unchanged .

3.9.1 FUNCTIONAL SET :

Sets interface data length, number of display lines, and character font. The starter

kit board supports a single function set with value 0x28. Execution time: 40µs

64

3.9.2 SET CG RAM ADDRESS :

Set the initial cg ram address. After this command, all subsequent read or write

operations to the display are to or from cg ram. Execution time: 40=µs

3.9.3 SET DD RAM ADDRESS :

Set the initial DD RAM address. After this command, all subsequent read or write

operations to the display are to or from dd ram. The addresses for displayed characters

appear in figure 3.3.execution time: 40µs

3.9.4 READ BUSY FLAG AND ADDRESS

Read the busy flag (bf) to determine if an internal operation is in progress, and read

the current address counter contents.

Bf = 1 indicates that an internal operation is in progress. The next instruction is not

accepted until bf is cleared or until the current instruction is allowed the maximum time to

execute.

This command also returns the present value of address counter. The address

counter is used for both cg ram and dd ram addresses. The specific context depends on

the most recent set cg ram address or set dd ram address command issued.

Execution time: 1µs

3.9.5 WRITE DATA TO CG RAM OR DD RAM :

Write data into DD RAM if the command follows a previous set DD RAM address

command, or write data into cg ram if the command follows a previous set cg ram address

command.

After the write operation, the address is automatically incremented or

decremented by 1 according to the entry mode set command. The entry mode also

determines display shift.

Execution time: 40µs.

3.9.6 READ DATA FROM CG RAM OR DD RAM:

Read data from DD RAM if the command follows a previous set DD RAM address

65

command, or read data from cg ram if the command follows a previous set cg ram

address command. After the read operation, the address is automatically incremented or

decremented by 1 according to the entry mode set command. However, a display shift is

not executed during read operations. Execution time: 40µs .

3.10 OPERATION:

3.10.1 FOUR BIT DATA INTERRFACE :

The board uses a 4-bit data interface to the character LCD.

Figures 5-6 illustrates a write operation to the LCD, showing the minimum times

allowed for setup, hold, and enable pulse length relative to the 50 MHZ clock (20 ns

period) provided on the board.

FIGURE 3.6 : CHARACTER LCD INTERFACE TIMING

66

The data values on SF_D<11:8>, and the register select (LCD_RS) and the read/write

(LCD_RW) control signals must be set up and stable at least 40 ns before the enable

LCD_E goes high. The enable signal must remain high for 230 ns or longer—the

equivalent of 12 or more clock cycles at 50 mhz.

In many applications, the LCD_RW signal can be tied low permanently because the

FPGA .generally has no reason to read information from the display.

3.10.2 TRANSFERRING 8-BIT DATA OVER THE 4-BIT INTERFACE :

After initializing the display and establishing communication, all commands and

data transfers to the character display are via 8 bits, transferred using two sequential 4-bit

operations. Each 8-bit transfer must be decomposed into two 4-bit transfers, spaced apart

by at least 1 µs, as shown in figure 5-6. The upper nibble is transferred first, followed by

the lower nibble. An 8-bit write operation must be spaced least 40 µs before the next

communication. This delay must be increased to 1.64 ms following a clear display

command.

3.10.3 INITIALIZING THE DISPLAY

After power-on, the display must be initialized to establish the required

communication protocol. The initialization sequence is simple and ideally suited to the

highly-efficient 8-bit picoblaze embedded controller. After initialization, the picoblaze

controller is available for more complex control or computation beyond simply driving the

display.

3.10.4 POWER-ON INITIALIZATION:

The initialization sequence first establishes that the FPGA application wishes to use

the four-bit data interface to the LCD as follows

Wait 15 ms Or Longer, Although The Display Is Generally Ready When The FPGA

Finishes

Configuration. The 15 ms Interval Is 750,000 Clock Cycles At 50 MHz.

Write SF_D<11:8> = 0x3, Pulse LCD_E High For 12 Clock Cycles.

Wait 4.1 ms Or Longer, Which Is 205,000 Clock Cycles At 50 MHz.


Wait 100 µs Or Longer, Which Is 5,000 Clock Cycles At 50 MHz.





67

3.10.5 DISPLAY CONFIGURATION :

After the power-on initialization is completed, the four-bit interface is now

established. The next part of the sequence configures the display

Issue A Function Set Command, 0x28, To Configure The Display For Operation On

The Spartan-3E Starter Kit Board.

Issue An Entry Mode Set Command, 0x06, To Set The Display To Automatically

Increment The Address Pointer.

Issue A Display On/Off Command, 0x0c, To Turn The Display On And Disables The

Cursor And Blinking. Finally, Issue A Clear Display Command.

Allow At Least 1.64 Ms (82,000 Clock Cycles) After Issuing This Command.

3.10.6 WRITING DATA TO THE DISPLAY :

To write data to the display, specify the start address, followed by one or more data

values. Before writing any data, issue a set DD ram address command to specify the initial

7-bit address in the DD RAM. See figure 3.3 for DD RAM locations.

Write data to the display using a write data to CG ram or DD ram command. The 8-

bit data value represents the look-up address into the CG ROM or CG RAM, shown in

figure 3.4. The stored bitmap in the CG ROM or cg ram drives the 5 x 8 dot matrix to

represent the associated character.

If the address counter is configured to auto-increment, as described earlier, the

application can sequentially write multiple character codes and each character is

automatically stored and displayed in the next available location.

Continuing to write characters, however, eventually falls off the end of the first

display line. The additional characters do not automatically appear on the second line

because the DD ram map is not consecutive from the first line to the second .

3.10.7 DISABLING THE UNUSED LCD :

If the FPGA application does not use the character LCD screen, drive the LCD_E

pin low to disable it. Also drive the LCD_RW pin low to prevent the LCD screen from

presenting data.

68

CHAPTER 4

RESULTS AND DISCUSIONS

4.1 RTL SCHEMATIC FOR ENCRYPTION:

4.2 SIMULATION WAVE FORM FOR AES ENCRYPTION:

69

4.4 SYNTHESIS REPORT SCMATIC FOR AES ENCRYPTION

====================================================================

* Final Report *

====================================================================

Final Results

RTL Top Level Output File Name : encryption.ngr

Top Level Output File Name : encryption

Output Format : NGC

Optimization Goal : Speed

Keep Hierarchy : NO

Design Statistics

# IOs : 258

Cell Usage :

# BELS : 47282

# GND : 1

# INV : 182

# LUT1 : 164

# LUT2 : 1776

# LUT2_L : 111

# LUT3 : 231

# LUT3_D : 8

# LUT3_L : 609

# LUT4 : 22881

# LUT4_D : 14

# LUT4_L : 273

# MUXF5 : 11338

# MUXF6 : 5698

# MUXF7 : 2839

# MUXF8 : 1156

# VCC : 1

# FlipFlops/Latches : 2784

# FD : 2626

# FDE : 129

# FDR : 26

# FDS : 3

# Shift Registers : 11

# SRL16 : 11

# Clock Buffers : 1

# BUFGP : 1

# IO Buffers : 256

# OBUF : 256

====================================================================

70

Device utilization summary:

---------------------------

Selected Device : 3s500efg320-5

Number of Slices: 14517 out of 4656 311% (*)

Number of Slice Flip Flops: 2784 out of 9312 29%

Number of 4 input LUTs: 26260 out of 9312 282% (*)

Number used as logic: 26249

Number used as Shift registers: 11

Number of IOs: 258

Number of bonded IOBs: 257 out of 232 110% (*)

Number of GCLKs: 1 out of 24 4%

WARNING:Xst:1336 - (*) More than 100% of Device resources are used

---------------------------

Partition Resource Summary:

---------------------------

No Partitions were found in this design.

---------------------------

====================================================================

TIMING REPORT

NOTE: THESE TIMING NUMBERS ARE ONLY A SYNTHESIS ESTIMATE.

FOR ACCURATE TIMING INFORMATION PLEASE REFER TO THE TRACE REPORT

GENERATED AFTER PLACE-and-ROUTE.

Clock Information:

------------------

-----------------------------------+------------------------+-------+

Clock Signal | Clock buffer(FF name) | Load |

-----------------------------------+------------------------+-------+

clk | BUFGP | 2795 |

-----------------------------------+------------------------+-------+

Asynchronous Control Signals Information:

----------------------------------------

No asynchronous control signals found in this design

71

Timing Summary:

---------------

Speed Grade: -5

Minimum period: 6.654ns (Maximum Frequency: 150.280MHz)

Minimum input arrival time before clock: No path found

Maximum output required time after clock: 4.040ns

Maximum combinational path delay: No path found

Timing Detail:

--------------

All values displayed in nanoseconds (ns)

====================================================================

Timing constraint: Default period analysis for Clock 'clk'

Clock period: 6.654ns (frequency: 150.280MHz)

Total number of paths / destination ports: 591072 / 2926

-------------------------------------------------------------------------

Delay: 6.654ns (Levels of Logic = 7)

Source: a35/x05_15 (FF)

Destination: a39/x05_109 (FF)

Source Clock: clk rising

Destination Clock: clk rising

Data Path: a35/x05_15 to a39/x05_109

Gate Net

Cell:in->out fanout Delay Delay Logical Name (Net Name)

---------------------------------------- ------------

FD:C->Q 11 0.514 0.796 a35/x05_15 (a35/x05_15)

LUT4:I3->O 3 0.612 0.481 a36/h1<12>_SW0 (N1659)

LUT4:I2->O 125 0.612 1.128 Mxor_data9_Result<12>1 (data9<12>)

LUT4:I2->O 1 0.612 0.000 a38/s2/Mrom_sbout1 (a38/s2/Mrom_sbout)

MUXF5:I1->O 1 0.278 0.000 a38/s2/Mrom_sbout_f5 (a38/s2/Mrom_sbout_f5)



MUXF8:I1->O 1 0.451 0.000 a38/s2/Mrom_sbout_f8 (sb9<8>)

FD:D 0.268 a39/x05_104

----------------------------------------

Total 6.654ns (4.249ns logic, 2.405ns route)

(63.9% logic, 36.1% route)

====================================================================

Timing constraint: Default OFFSET OUT AFTER for Clock 'clk'


-------------------------------------------------------------------------

72

Offset: 4.040ns (Levels of Logic = 1)

Source: eout_127 (FF)

Destination: eout<127> (PAD)


Data Path: eout_127 to eout<127>

Gate Net


---------------------------------------- ------------

FDE:C->Q 1 0.514 0.357 eout_127 (eout_127)

OBUF:I->O 3.169 eout_127_OBUF (eout<127>)

----------------------------------------


(91.2% logic, 8.8% route)

====================================================================

Total REAL time to Xst completion: 359.00 secs

Total CPU time to Xst completion: 359.06 secs

-->

Total memory usage is 302972 kilobytes

Number of errors : 0 ( 0 filtered)

Number of warnings : 376 ( 0 filtered)

Number of infos : 34 ( 0 filtered)

73

4.5 RTL SCHEMATIC FOR AES DECRYPTION

4.7 SIMULATION WAVE FORM FOR AES DECRYPTION:

74

4.8 SYNTHESIS REPORT FOR AES DECRYPTION:

====================================================================

* Final Report *

====================================================================

Final Results

RTL Top Level Output File Name : decryption.ngr

Top Level Output File Name : decryption

Output Format : NGC

Optimization Goal : Speed

Keep Hierarchy : NO

Design Statistics

# IOs : 258

Cell Usage :

# BELS : 56417

# GND : 1

# INV : 27

# LUT1 : 143

# LUT2 : 1461

# LUT2_D : 788

# LUT2_L : 101

# LUT3 : 506

# LUT3_D : 5

# LUT3_L : 13

# LUT4 : 28454

# LUT4_D : 604

# LUT4_L : 967

# MUXF5 : 12541

# MUXF6 : 6272

# MUXF7 : 3136

# MUXF8 : 1397

# VCC : 1

# FlipFlops/Latches : 3118

# FD : 2815

# FDE : 129

# FDR : 99

# FDS : 3

# LDCP : 72

# Shift Registers : 11

# SRL16 : 11

# Clock Buffers : 1

# BUFGP : 1

# IO Buffers : 256

75

# OBUF : 256

====================================================================

Device utilization summary:

---------------------------

Selected Device : 3s100evq100-5

Number of Slices: 16884 out of 960 1758% (*)

Number of Slice Flip Flops: 3118 out of 1920 162% (*)

Number of 4 input LUTs: 33080 out of 1920 1722% (*)

Number used as logic: 33069

Number used as Shift registers: 11

Number of IOs: 258

Number of bonded IOBs: 257 out of 66 389% (*)

Number of GCLKs: 1 out of 24 4%

WARNING:Xst:1336 - (*) More than 100% of Device resources are used

---------------------------

Partition Resource Summary:

---------------------------

No Partitions were found in this design.

---------------------------

====================================================================

TIMING REPORT

NOTE: THESE TIMING NUMBERS ARE ONLY A SYNTHESIS ESTIMATE.

FOR ACCURATE TIMING INFORMATION PLEASE REFER TO THE TRACE REPORT

GENERATED AFTER PLACE-and-ROUTE.

Clock Information:

------------------

---------------------------------------------+------------------------+-------+

Clock Signal | Clock buffer(FF name) | Load |

---------------------------------------------+------------------------+-------+

clk | BUFGP | 3057 |

d37/x40_8_cmp_lt0000(d37/x40_8_cmp_lt00001:O)| NONE(*)(d37/x40_8) | 8 |




76






---------------------------------------------+------------------------+-------+

(*) These 9 clock signal(s) are generated by combinatorial logic,

and XST is not able to identify which are the primary clock signals.

Please use the CLOCK_SIGNAL constraint to specify the clock signal(s) generated by

combinatorial logic.

INFO:Xst:2169 - HDL ADVISOR - Some clock signals were not automatically buffered by XST

with BUFG/BUFR resources. Please use the buffer_type constraint in order to insert these

buffers to the clock signals to help prevent skew problems.

Timing Summary:

---------------

Speed Grade: -5

Minimum period: 9.510ns (Maximum Frequency: 105.149MHz)

Minimum input arrival time before clock: No path found

Maximum output required time after clock: 4.040ns

Maximum combinational path delay: No path found

Timing Detail:

--------------

All values displayed in nanoseconds (ns)

====================================================================

Timing constraint: Default period analysis for Clock 'clk'

Clock period: 9.510ns (frequency: 105.149MHz)


-------------------------------------------------------------------------

Delay: 9.510ns (Levels of Logic = 9)

Source: d35/x09_66 (FF)

Destination: d38/x09_19 (FF)


Destination Clock: clk rising

Data Path: d35/x09_66 to d38/x09_19

Gate Net


---------------------------------------- ------------

FD:C->Q 127 0.514 1.250 d35/x09_66 (d35/x09_66)

77

LUT1:I0->O 1 0.612 0.000 d36/is9/Mrom_sibout14_f5_4_rt

(d36/is9/Mrom_sibout14_f5_4_rt)

MUXF5:I1->O 1 0.278 0.000 d36/is9/Mrom_sibout14_f5_4

(d36/is9/Mrom_sibout14_f55)





MUXF8:I0->O 23 0.451 1.091 d36/is9/Mrom_sibout14_f8 (isb9<71>)

LUT2:I1->O 16 0.612 0.882 Mxor_round9_Result<71>1 (round9<71>)

LUT4_D:I3->O 3 0.612 0.454 d37/x40_15_mux00011 (d37/x40_15_mux0001)

LUT4_D:I3->O 1 0.612 0.360 d37/x40_19_mux00001 (d37/x40_19_mux0000)

LUT4:I3->O 1 0.612 0.000 d37/Mxor_rout<83>_xo<0>39 (imout9<83>)

FD:D 0.268 d38/x09_19

----------------------------------------


(57.5% logic, 42.5% route)

====================================================================

Timing constraint: Default OFFSET OUT AFTER for Clock 'clk'


-------------------------------------------------------------------------

Offset: 4.040ns (Levels of Logic = 1)

Source: deout_127 (FF)

Destination: deout<127> (PAD)


Data Path: deout_127 to deout<127>

Gate Net


---------------------------------------- ------------

FDE:C->Q 1 0.514 0.357 deout_127 (deout_127)

OBUF:I->O 3.169 deout_127_OBUF (deout<127>)

----------------------------------------


(91.2% logic, 8.8% route)

====================================================================

Total REAL time to Xst completion: 302.00 secs

Total CPU time to Xst completion: 301.67 secs

Total memory usage is 339772 kilobytes

Number of errors : 0 ( 0 filtered)

Number of warnings : 126 ( 0 filtered)

Number of infos : 32 ( 0 filtered)

78

CHAPTER 5

SUMMARY

5.1 PROJECT SUMMARY:

The Advanced Encryption Standard(AES) is a securiety standard that became

effective on May 26,2002 by NIST to replace DES.the cryptography scheame is

asymmetric block cipher that encrypts and decrypts 128-bit blocks of datd.lengths of

128,192 and 256 bits are standard key lengths used by AES.

Plain text refers to the data to be encrypted. Cipher text refers to the data after

going through the cipher as well as the data that will be going into the decipher. The

state is an intermediate form of the cipher or deciphers result usually displayed as a

rectangular table of bytes with 4 rows and 4 columns

The first stage ―subbytes‖ transformation a non-linear byes substitution for each

byte of the block. The second stage‖shiftrows‖ transformation cyclically shifts(permutes)

the bytes within the block. The third stage‖mixcolumns‖ transformation groups 4-bytes

together forming 4-term polynomials and multiplies the polynomials with a fixed

polynomial mod(X4+1).The fourth stage‖add roundkey‖transformation adds the round key

with the block of data. The decipher is simply the inverse of cipher

The algorithm consists of four stages that make up a round which is iterated 10

times for a 128-bit length key, 12 times for 192-bit key and 14 times for a 256-bit key.

79

5.2 CONCLUSION:

The main advantage with the Advanced Encryption Standard is to maintain the

secret communication between the Encryption and Decryption. It is the symmetric key

encryption algorithm. This reduces the complexity of the Encrypt and Decrypt the data.

Cipher key is same for both the Encryption and Decryption process

VHDL code is used to develop the implementation of Encryption and Decryption

process. Each program is tested with the some of the sample vectors provided by NIST

and output results are perfect with minimal delay. In the case of 192,256-bit key

algorithm, it requires 192,256-bit plain text and 128-bit cipher key.

AES is important to understand the using the algorithm, it will greatly increase the

reliability and safety of software systems. Therefore, AES can indeed be implemented

with reasonable efficiency on an FPGA, with the encryption and decryption taking an

average of 320 and 340 ns respectively (for every 128 bits). The time varies from chip to

chip and the calculated delay time can only be regarded as approximate. Adding data

pipelines and some parallel combinational logic in the key scheduler and round

calculator can further optimize this design.

There is currently no evidence that AES has any weakness making any attack

other than exhaustive search. Even AES-128 bit offers a sufficiently large number of

possible keys, making an exhaustive search impractical for many decades, provided no

technological breakthrough causes the computational power available to increase

dramatically and that theoretical research does not find a short cut to bypass the need

for exhaustive search. There are many pitfalls to avoid when encryption is implemented

and keys are generated.

It is necessary to ensure each and every implementations security, an important

correctly implemented AES-128 is likely to protect against a million dollar budget for at

least 50-60 years and against individual budgets for at least another ten years.

5.3 SCOPE OF EXPANSION :

This algorithm is also implemented with the 192,256-bit keys.

By using with this design we are also implemented as a crypto-processor for secret

communication.

This algorithm is also used to implement as crypto processor for smartcards.

80

APPENDEX

REFERENCES:

The following diagram shows the values in the state array as the encryption

progresses for a block length and a key length of 16 bytes each(i.e Nb =4 and Nk =4)[1]

Input = 32 43 f6 a8 88 5a 30 80 31 31 98 a2 e0 37 07 34

Cipher key= 2b 7e 15 16 28 ae d2 a6 ab f7 15 88 09 cf 4f 3c

81

82

[1] FIPS 197, ―Advanced Encryption Standard (AES)”, November 26, 2001

http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf

[2] J. Daemen and V. Rijmen, ―AES Proposal: Rijndael‖, AES Algorithm Submission,

September 3, 1999

http://www.esat.kuleuven.ac.be/~rijmen/rijndael/rijndaeldocV2.zip

[3] ―FPGA Simulations of Round 2 Advanced Encryption Standards‖

http://csrc.nist.gov/CryptoToolkit/aes/round2/conf3/presentations/elbirt.pdf.

[4] http://en.wikipedia.org/wiki/Extended_Euclidean_algorithm

[5] Tilborg, Henk C. A. van. ―Fundamentals of Cryptology: A Professional Reference

and Interactive Tutorial‖, New York Kluwer Academic Publishers, 2002

[6] Peter J. Ashenden, ―The designers‘s Guide to VHDL‖, 2nd Edition, San Francisco,

CA, Morgan Kaufmann, 2002

References:understanding mix columns

[7 Wikipedia – Rijndael mix columns, [Online]

Available: http://en.wikipedia.org/wiki/Rijndael_mix_columns

[8] William Stalling (2006), Chapter 4.6 Finite Fields of the Form GF(2n) – Multiplication,

in Cryptography and Network Security: Principles and Practices, Page 125 – 126.

References:understanding inverse mix columns

[9] William Stalling (2006), Chapter 4.6 Finite Fields of the Form GF(2n) – Multiplication,

in Cryptography and Network Security: Principles and Practices, Page 125 – 126.

[10] Kit Choy Xintong (2009), Understanding AES Mix-Columns Transformation

Calculation

[Available] Online: http://sites.google.com/site/kitworldoftheory/Home/mixcolumns.pdf?

attredirects=0

FPGA REFERENCES:

Initial Design For Spartan-3e Starter Kit (Reference Design)

http://www.xilinx.com/s3estarter

Powertip Pc1602-D Character Lcd (Basic Electrical And Mechanical Data)

http://www.powertipusa.com/pdf/pc1602d.pdf

Sitronix St7066u Character Lcd Controller

http://www.sitronix.com.tw/sitronix/product.nsf/doc/st7066u?opendocument

http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf

http://www.esat.kuleuven.ac.be/~rijmen/rijndael/rijndaeldocV2.zip

http://csrc.nist.gov/CryptoToolkit/aes/round2/conf3/presentations/elbirt.pdf

http://en.wikipedia.org/wiki/Extended_Euclidean_algorithm

http://www.xilinx.com/s3estarter

http://www.powertipusa.com/pdf/pc1602d.pdf

http://www.sitronix.com.tw/sitronix/product.nsf/Doc/ST7066U?OpenDocument

83

Detailed Data Sheet On Powertip Character Lcd

http://www.rapidelectronics.co.uk/images/siteimg/57-0910e.pdf

Samsung S6a0069x Character Lcd Controller

http://www.samsung.com/products/semiconductor/displaydriveric/mobileddi/bwstn

/s6a0069x/s6a0069x.htm

http://www.rapidelectronics.co.uk/images/siteimg/57-0910e.PDF

http://www.samsung.com/Products/Semiconductor/DisplayDriverIC/MobileDDI/BWSTN/S6A0069X/S6A0069X.htm

http://www.samsung.com/Products/Semiconductor/DisplayDriverIC/MobileDDI/BWSTN/S6A0069X/S6A0069X.htm