Top Banner
DESIGN AND IMPLEMENTATION OF LOW AREA AND LOW POWER AES ENCRYPTION HARWARE CORE By Group No.16 1. VAIBHAV GUPTA (0809131251) 2. VIJAY KUMAR VERMA (0809131095) 3. SHIVANI CHAURASIA (0809131082) Under the guidance of Mr.SAMPATH KUMAR DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING JSS ACADEMY OF TECHNICAL EDUCATION
54

Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

Sep 05, 2014

Download

Documents

Vaibhav Gupta
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

DESIGN AND IMPLEMENTATION OF LOW AREA AND LOW POWER AES ENCRYPTION

HARWARE CORE

By

Group No.16

1. VAIBHAV GUPTA (0809131251) 2. VIJAY KUMAR VERMA (0809131095)3. SHIVANI CHAURASIA (0809131082)

Under the guidance of

Mr.SAMPATH KUMAR

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING

JSS ACADEMY OF TECHNICAL EDUCATION C-20/1 SECTOR-62, NOIDA

April, 2012

Page 2: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

Chapter 1

INTRODUCTION

In today’s digital world, encryption is emerging as a disintegrable part of allcommunication networks and information processing systems, for protecting both storedand in transit data. Encryption is the transformation of plain data (known as plaintext)into unintelligible data (known as ciphertext) through an algorithm referred to as cipher.There are numerous encryption algorithms that are now commonly used in computation,but the U.S. government has adopted the Advanced Encryption Standard (AES) to beused by Federal departments and agencies for protecting sensitive information. TheNational Institute of Standards and Technology (NIST) has published the specificationsof this encryption standard in the Federal Information Processing Standards (FIPS)Publication 197.

Any conventional symmetric cipher, such as AES, requires a single key for bothencryption and decryption, which is independent of the plaintext and the cipher itself. Itshould be impractical to retrieve the plaintext solely based on the ciphertext and theencryption algorithm, without knowing the encryption key. Thus, the secrecy of theencryption key is of high importance in symmetric ciphers such as AES. Softwareimplementation of encryption algorithms does not provide ultimate secrecy of the keysince the operating system, on which the encryption software runs, is always vulnerableto attacks.

There are other important drawbacks in software implementation of any encryptionalgorithm, including lack of CPU instructions operating on very large operands, wordsize mismatch on different operating systems and less parallelism in software. Inaddition, software implementation does not fulfill the required speed for time criticalencryption applications. Thus, hardware implementation of encryption algorithms is animportant alternative, since it provides ultimate secrecy of the encryption key, fasterspeed and more efficiency through higher levels of parallelism.

Page 3: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

Different versions of AES algorithm exist today (AES128, AES196, and AES256)depending on the size of the encryption key. In this project, a hardware model forimplementing the AES128 algorithm was developed using the SystemVerilog hardwaredescription language. A unique feature of the design proposed in this project is that theround keys, which are consumed during different iterations of encryption, are generatedin parallel with the encryption process.

The hardware model was then completely verified using a testbench, which tookadvantage of the SystemVerilog’s object oriented programming (OOP) feature, byconstructing random test objects and providing them to the model. The validationprocess continued until the model was verified for a certain Functional Coverage. Then,the verified model was synthesized using the Synopsis Design-Compiler tool to get anestimate of the number of gates, area and timing of the hardware model.

In addition, the AES128 algorithm was modeled in “C” language and was ported ona Simics virtual system. The statistics of the Simics virtual system was gathered to get anestimate of the time it would take to encrypt a plaintext block on the virtual system.

Finally, the performances of software and hardware implementations were compared.

Page 4: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

Chapter 2

ADVANCED ENCRYPTION STANDARD (AES)

2.1 OVERVIEW

This chapter is a summary of the Federal Information Processing Standards (FIPS)Publication 197 , issued by the National Institute of Standards and Technology (NIST)which specifies the Advanced Encryption Standard. Throughout the remainder of thischapter, the mathematical properties of the Advanced Encryption Standard (AES) areintroduced using the information obtained from the AES specification.

The AES is a subset of a much larger encryption algorithm known as Rijndael,which was one of many proposals to the NIST competing for becoming a standardencryption algorithm. On October of 2000, the NIST announced the Rijndael algorithmas the winner due to the best overall score in security, performance, efficiency,implementation capability and simplicity.

The AES algor ithm is a symmetric cipher. In symmetric ciphers, a single secret key is used for both the encryption and decryption, whereas in asymmetric ciphers, there aretwo sets of keys known as private and public keys. The plaintext is encrypted using thepublic key and can only be decrypted using the private key.

In addition, the AES algorithm is a block cipher as it operates on fixed-lengthgroups of bits (blocks), whereas in stream ciphers, the plaintext bits are encrypted one ata time, and the set of transformations applied to successive bits may vary during theencryption process.

The AES algorithm operates on blocks of 128 bits, by using cipher keys withlengths of 128, 192 or 256 bits for the encryption process. Although the original Rijndaelencryption algorithm was capable of processing different blocks sizes as well as usingseveral other cipher key lengths, but the NIST did not adopt these additional features inthe AES.

Page 5: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

2.2 INPUTS, OUTPUTS AND THE STATE

The plaintext input and ciphertext output for the AES algorithms are blocks of128 bits. The cipher key input is a sequence of 128, 192 or 256 bits. In other words thelength of the cipher key, Nk, is either 4, 6 or 8 words which represent the number ofcolumns in the cipher key. The AES algorithm is categorized into three versions basedon the cipher key length. The number of rounds of encryption for each AES versiondepends on the cipher key size.

In the AES algorithm, the number of rounds is represented by Nr, where Nr = 10when Nk = 4, Nr = 12 when Nk = 6, and Nr = 14 when Nk = 8. The following table illustrated the variations of the AES algorithm. For the AES algorithm the block size (Nb), which represents the number of columns comprising the State is Nb = 4.

TABLE 2.1-AES VARIATIONS

The basic processing unit for the AES algorithm is a byte. As a result, the plaintext,ciphertext and the cipher key are arranged and processed as arrays of bytes. For an input,an output or a cipher key denoted by a, the bytes in the resulting array are referenced asan , where n is in one of the following ranges:Block length = 128 bits, 0 <= n < 16Key length = 128 bits, 0 <= n < 16Key length = 192 bits, 0 <= n < 24Key length = 256 bits, 0 <= n < 24

Page 6: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

All byte values in the AES algorithm are presented as the concatenation of theirindividual bit values between braces in the order {b7, b6, b5, b4, b3, b2, b1, b0}. Thesebytes are interpreted as finite field elements using a polynomial representation:

As an example, {10001001} (or {85} in hexadecimal) identifies the polynomial

.The arrays of bytes in the AES algorithm are represented as .

All the AES algorithm operations are performed on a two dimensional 4x4 arrayof bytes which is called the State, and any individual byte within the State is referred toas Sr,c, where letter ‘r’ represent the row and letter ‘c’ denotes the column. At thebeginning of the encryption process, the State is populated with the plaintext. Then thecipher performs a set of substitutions and permutations on the State. After the cipheroperations are conducted on the State, the final value of the state is copied to thecipher text output as is shown in the following figure.

Figure 2.2 – State Population and Results

At the beginning of the cipher, the input array is copied into the State accordingthe following scheme:s[r,c] = in [r + 4c] for 0 <= r < 4 and 0 <= c < 4 ,and at the end of the cipher the State is copied into the output array as shown below:out[r+4c] = s[r,c] for 0 <= r < 4 and 0 <= c < 4.

Page 7: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

2.3 Cipher Transformations

The AES cipher either operates on individual bytes of the State or an entirerow/column. At the start of the cipher, the input is copied into the State as described inSection 2.2. Then, an initial Round Key addition is performed on the State. Round keysare derived from the cipher key using the Key Expansion routine. The key expansionroutine generates a series of round keys for each round of transformations that areperformed on the State.

The transformations performed on the state are similar among all AES versionsbut the number of transformation rounds depends on the cipher key length. The finalround in all AES versions differs slightly from the first Nr −1 rounds as it has one lesstransformation performed on the State. Each round of AES cipher (except the last one)consists of all the following transformation:

- SubBytes( )

- ShiftRows( )

- MixColumns( )

- AddRoundKey ( )

The AES cipher is described as a pseudo code in Figure 2.3. As shown in thepseudo code, all the Nr rounds are identical with the exception of the final round whichdoes not include the MixColumns transformation. The array w[] represents the roundkeys that are generated by the key expansion routine. In the following sections,individual transformations that are used in each encryption round are described.

Page 8: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

Figure 2.3 – AES Cipher

2.3.1 - Subbytes ( ) transformation

The SubBytes is a byte substitution operation performed on individual bytes of theState, as shown in Figure 2.4, using a substitution table called S-box.

Figure 2.4 – SubBytes Transformation

Page 9: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

The invertible S-box table is constructed by performing the following transformation oneach byte of the State.- Take the multiplicative inverse in the finite field GF(28) of the byte.- Apply the following transformation to the byte:

The bi is the ith bit of the byte and ci is the ith bit of a constant byte with the value of {63}.The combination of the two transformations can be expressed in matrix form as shownbelow:

The S-box table shown in Table 2 is constructed by performing the two transformations described earlier for all possible values of a byte, ranging from {00} to {ff}. For example the substitution value for {53} would be determined by the intersection of the row with index ‘5’ and the column with index ‘3’.

Table 2.5 – AES S-box

Page 10: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

2.3.2 - Shiftrows ( ) transformation

The ShiftRows transformation cyclically shifts the last three rows of the state bydifferent offsets. The first row is left unchanged in this transformation. Each byte of thesecond row is shifted one position to the left. The third and fourth rows are shifted leftby two and three positions, respectively. The ShiftRows transformation is illustrated inFigure 2.6.

Figure 2.6 – ShiftRows Transformation

Page 11: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

2.3.3 Mixcolumns ( ) transformationThis transformation operates on the columns of the State, treating each columnsas a four term polynomial the finite field GF(2^8). Each columns is multiplied modulox4+1 with a fixed four-term polynomial a(x) = {03}x3 + {01}x2 + {01}x + {02} over theGF(2^8). The MixColumns transformation can be expressed as a matrix multiplication asshown below:

The MixColumns transformation replaces the four bytes of the processed columnwith the following values:

The “ • ” corresponds to the multiplication of polynomials in GF(2^8) modulo anirreducible polynomial of degree 8. A polynomial is irreducible if its only divisors areone and itself. For the AES algorithm the irreducible polynomial is:

Page 12: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

The MixColumns transformation is illustrated in Figure 2.7. This transformationtogether with ShiftRows, provide substantial diffusion in the cipher meaning that theresult of the cipher depends on the cipher inputs in a very complex way. In other words,in a cipher with a good diffusion, a single bit change in the plaintext will completelychange the ciphertext in an unpredictable manner.

Figure 2.7 – MixColumns Transformation

Page 13: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

2.3.4 AddRoundKey ( ) Transformation

During the AddRoundKey transformation, the round key values are added to theState by means of a simple Exclusive Or (XOR) operation. Each round key consists ofNb words that are generated from the KeyExpansion routine. The round key values areadded to the columns of the state in the following way:

In the equation above, the round value is between r 0 £ round £ N . Whenround=0, the cipher key itself is used as the round key and it corresponds to the initialAddRoundKey transformation displayed in the pseudo code in Figure 2.3. The AddRoundKey transformation is illustrated in Figure 2.8.

Figure 2.8 – AddRoundKey Transformation

Page 14: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

2.4 AES KEY EXPANSION

The AES algorithm requires four words of round keys for each encryption round.That is total of 4*(Nr + 1) round keys considering the initial set of keys required for thefirst AddRoundKey transformation. All the round keys are derived from the cipher keyitself.

According to the Federal Information Processing Standards (FIPS) Publication197 , there is no restriction on the cipher key selection, as no week cipher key has beenidentified for the AES algorithm. The expansion of the cipher key into the round keys isperformed by the KeyExpansion algorithm as shown in the pseudo code in Figure 2.9.

Figure 2.9 – KeyExpansion Algorithm

Page 15: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

In the above pseudo code, the array w[] represents the round keys that are generatedby the KeyExpansion routine and Nk represents the size of the cipher key. Depending onthe version of the AES algorithm, Nk=4, 6 or 8. The first Nk words of the expanded keyare filled with the cipher key.

The SubWord( ) function applies the same S-box substitution to each of the fourbytes in the word. The RotWord( ) function takes a word [a0,a1,a2,a3] as input andperform a cyclic shift and returns the word [a1,a2,a3,a0]. The round constant word array,Rcon[i], contains a 32 bit value given by [{02}i-1,{00},{00},{00}].

Every following round key , w[i], is equal to the XOR of the previous round key,w[i-1], and the word Nk positions earlier, w[i-Nk]. For words in positions that are amultiple of Nk, two transformations are initially applied to the previous round key, w[i-1].These transformations are a cyclic shift of the bytes in the previous round key, followedby the application of the S-box table lookup to all four bytes of the word. Afterwards, anXOR with a round constant value, Rcon[i], is applied to the previous round key.

The KeyExpansion routine for the AES256 (Nk=8) is slightly different than theAES128 and AES192 ones, as an additional SubWord function is applied to the previousround key, w[i-1], prior to the XOR with w[i- Nk].

Page 16: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

CHAPTER 3

AES128 DESIGN AND IMPLEMENTATION

3.1 OVERVIEW

In this chapter, a hardware model for implementing the AES128 algorithm isintroduced. The model is implemented using the SystemVerilog hardware descriptionlanguage . This chapter covers the design and implementation issues of the AES128algorithm. In the next chapter, a test infrastructure is presented that thoroughly tests thefunctionality of the implemented model. The hardware model developed in this chapteris synthesizable. This means that the model provides a cycle-by-cycle RTL descriptionof the circuit that a logic synthesis tool can convert to an optimized gate-level netlist.

The modeling process utilized in this project is the bottom-up approach. Thismeans that the leaf components in the design hierarchy were developed first and thehigher-level modules were constructed by instantiating their subcomponents andconnecting them with the internal signals. All the modules in the design hierarchy weremodeled in behavioral style, but the root module consisted of data flow modeling as wellto implement the four major cipher transformations.

3.2 DESIGN HIERARCHY

The proposed AES128 hardware model is a 3-level hierarchical design as shown inFigure 3.1. The root module in the hierarchy is the AES128_cipher_top. This moduleimplements the AES128 pseudo code displayed in Figure 2.3. It has two 128-bit inputs forreceiving the cipher key and the plaintext. There is also a single bit input signal, ‘Ld’,which is used to indicate the availability of a new set of plaintext or cipher key on theinput ports. The completion of the encryption process is indicated by asserting the ‘done’ single bit output.

Page 17: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

Figure 3.1 – Design Hierarchy

A unique feature of the proposed design is that the AES128_Key_Expand module ispipelined with the AES128_cipher_top module. While the AES128_cipher_top moduleis performing an iteration of the encryption transformations on the State using thepreviously generated round keys, the AES128_Key_Expand produces the next round’sset of keys to be used by the root module in the next encryption iteration.

3.2.1 Aes128 Encryption Process

The AES128_cipher_top module state diagram is shown in Figure 3.2. There are tenrounds of transformations represented by r1 to r10 states. The four ciphertransformations introduced in section 2.3 are applied to each state. The r0 statecorresponds to the initial AddRoundKey transformation in Figure 2.3.

Page 18: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

After leaving the Reset state, the AES128_Cipher_Top module waits for assertionof the ‘Ld’ signal, which indicates that a valid set of plaintext and cipher key is availableon the input ports. After reaching the r0 state, there is a transition on every clock cyclefor the next ten cycles, as ten rounds of encryption is applied to the State.

After going through ten rounds of transformations, the ‘done’ signal is asserted toindicate the completion of cipher and availability of the ciphertext on the correspondingoutput port.

Figure 3.2 – AES128_Cipher_Top Module State Diagram

3.2.2 AES128 Round Key Generation

The round keys used by the AES128_Cipher_Top module are generated based onthe state diagram shown in Figure 3.3. The AES128_Key_Expand and theAES128_RCon modules are responsible for generating the round keys. These twomodules operate based on the state diagram shown in Figure 3.3, which is slightlydifferent than the one used for the encryption process.

Page 19: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

Figure 3.3 – AES128_Key_Expand Module State Diagram

In the state diagram shown above, the ‘Ld’ signal is checked in the ‘r0’ state and if asserted, then the cipher key is provided to the AES128_Cipher_Top module to be used for the initial AddRoundKey transformation.

The AES128_Key_Expand module generates four 32-bit keys for each round of the encryption process, by using the cipher key. Figure 12 shows the block diagram of the AES128_Key_Expand module. The cipher key is passed to this module through a 128-bit input port, and the round keys are generated on the four output ports.

Page 20: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

Figure 3.4 – AES128_Key_Expand Module

There is a 32-bit round constant value, which is used by the key expansionalgorithm to generate the round keys. This value varies for each encryption round and forNr=1 to Nr=10 is given by [{02}i-1,{00},{00},{00}]. The AES128_RCcon module is usedto generate this value as shown in Figure 13. The AES128_RCon module also operatesbased on the state diagram shown in Figure 10.

Page 21: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

Figure 3.5 – AES128_Rcon Module

3.3 AES128 PIPELINED DESIGN

As stated earlier in this chapter, the round key generation in the proposed design ispipelined with the encryption rounds. The pipelined operation of the round keyexpansion and the cipher is shown in Figure 11. Each AES encryption round ‘n’ (whitecells) is pipelined with the key generation for round ‘n+1’ (gray cells).

Page 22: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

Figure 3.6 – AES128 Pipelined Round Key Generation and Cipher Rounds

The most important advantage of the pipelined design is the lower delay for eachencryption iteration, since the round keys for each encryption iteration is present at thebeginning of the iteration cycle. The lower delay in each encryption iteration meansfaster completion of each round of encryption. This reduces the overall encryption delayand allows the design to operate at higher clock frequencies. The higher clock frequencywill increase the message encryption rate (throughput) making this design suitable fortime critical encryption applications.

Page 23: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

CHAPTER 4

AES128 VERIFICATION

4.1 OVERVIEW

In this chapter, we describe the test infrastructure that is developed inSystemVerilog to verify the functionality of the model described in the previous chapter.The simulation was done using the Synopsis VCS tool. The testbench fully validated thedesign by constructing random cyclic test vectors for the plaintext and the cipher key,passing them to the model, and comparing the ciphertext to the expected result.

4.2 Testbench Infrastructure

There are four major steps involved in verifying a design using an HDL, includingtest vector generation, passing the test vectors to the design and capturing the designresponse, determining correctness by comparing the design response with the expectedresults, and measuring the verification coverage. The test infrastructure described in thischapter performs all the above steps in a systematic way.

The AES128 test infrastructure contains several components, some of which areunique SystemVerilog features. These SystemVerilog features make the verification of adesign more reliable and more structured. The test infrastructure components aredisplayed in Figure 4.1 as part of the AES128_Top module.

Page 24: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

Figure 4.1 – AES128 Test Infrastructure

The test infrastructure utilizes the SystemVerilog program block, which hasmultiple implicit timing regions to evaluate the design events separately from thetestbench events. The program block is connected to the model through another uniquefeature of the SystemVerilog, called Interface.

The Interface bundles the connections between the testbench and the design whileenforcing the synchronization and communication protocol between the two entities. The definition of the AES128_Top module in SystemVerilog is shown in Figure 4.2,which has the high-level instantiation of the modules constructing the test infrastructure.

Figure4.2 – AES128_Top Definition

Page 25: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

The AES128_Top module instantiates the design, Interface and the Program. TheInterface and the Program constructs are discussed in the next two sections. The clock generatoris defined inside the AES128_top module as well, to avoid any potential race conditions.

4.3 Aes128_Interface

As designs are becoming more complex, the number of module ports and thecomplexity of the interconnections between the modules are also increasing. TheSystemVerilog Interface construct is the solution for properly connecting the modules asit provides an intelligent means of communication between several modules.

The Interface bundles the ports together and enforces synchronization between themodules connected through it. The Interface can provide connectivity between designmodules and/or testbench. The modport construct is used in an Interface to specify thedirection of signals that are bundled together and to group the signals that are synchronous to a specifc clock. In this project, the SystemVerilog Interface was onlyused to connect the high-level design with the testbench as shown in Figure 4.1. As aresult, there were two modports declared for the Interface in this project.

In an Interface, the signals that are synchronous to a clock are defined inside aClocking Block to ensure correct timing between the testbench and the high-level design.This ensures that any synchronous signal is driven or sampled with respect to clock andeliminates the potential race condition that exists between the testbench and high-leveldesign written in Verilog. The AES128_Interface definition is shown in Figure 4.3.

Page 26: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

Figure 4.3 – AES128_Interface Definition

4.3 AES128_Program

In Verilog, a testbench is basically another module which is connected to the highleveldesign. This can cause a race condition between the testbench and the design. System Verilog hardware description language introduces a new construct called Programto be used as the testbench. “The SystemVerilog Program, having one (or more entry)points, is closer to a program in C, than Verilog’s many small blocks of concurrentlyexecuting hardware” . It also has multiple implicit timing regions to evaluate thedesign events separately from the testbench event, eliminating any race conditionbetween the design under test and the testbench.

The testbench described in this chapter consists of a single Program, which uses theObject Oriented Programming feature of SystemVerilog to dynamically build random testvectors. This is done by defining a Class inside the AES128_Program that encapsulatestwo random cyclic variables (Properties) for generating stimulus to the high-level design.The class defined in the AES128_Program is shown in Figure 4.4.

Page 27: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

As stated earlier in this chapter, another important feature of a testbench is keepingtrack of the verification coverage. In other words, to make sure that a design isthoroughly verified, the testbench needs to test all the design features. “FunctionalCoverage is a measure of which design features have been exercised by the test”.

Functional Coverage is done by means of Cover Groups defined inside the System Verilog Program. Each Cover Group consists of multiple Cover Points that are the variables used for generating stimulus for the design under test. As it is shown in Figure 4.4, the class defined in the AES128_Program uses a single Cover Group to keep track of the 128-bit plain_text and cipher_key stimuli. Due to limitations of the Synopsys VCS compiler that limits the cyclic random objects to no more than 16 bits, the 128-bit stimuli are broken into arrays of 16-bit elements. Each array element is declared as a Cover Point inside the Cover Group to be sampled together for measuring the Functional Coverage.

Figure 4.4 – Class Definition in the AES128_Program

Page 28: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

The AES128_Program pseudo code is shown in Figure 4.5. This testbench verifiesthe design until the Functional Coverage is 100%. The verification procedure involvesgenerating the stimuli and passing them through the AES128_Interface to the designunder test and verifying correctness of the results obtained from the design.

Figure 4.5 – AES128_Program Pseudo Code

To verify the correct functionality of the design under test, a C-style function isdeveloped in SystemVerilog, which takes the stimuli as input and calculates the expectedciphertext. This function is defined as part of package that contains all the variables androutines involved in the encryption process as shown in Figure 4.6.

Page 29: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

Figure 4.6 – AES128_Testbench_Package pseudo code

Page 30: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

The complete simulation result of the testbench is included in Appendix C.Figure 4.7 illustrates the simulation result for the first three test cases. Each test case startswith randomizing the cover points to populate the plaintext and cipher key inputs to thedesign under test. Then, the expected ciphertext is calculated using the AES128_cipherfunction shown in Figure 4.6. After the design under test has encrypted the plaintext andthe “done” signal is asserted, the ciphertext generated by the hardware model is comparedwith the expected result to catch any mismatch. The last step in each test case is gatheringthe Functional Coverage and continuing with the next test case until all design featuresare tested.

Figure 4.7 – Sample Simulation Results

Page 31: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

CHAPTER 5

AES128 SYNTHESIS

5.1 Overview

A primary objective of this project was to develop a synthesizable model for theAES128 encryption algorithm. Synthesis is the process of converting the register transferlevel (RTL) representation of a design into an optimized gate-level netlist. This is amajor step in ASIC design flow that takes an RTL model closer to a low-level hardwareimplementation.

Synthesis consists of three main steps. The first step is the “Translation”, whichinvolves converting the RTL description of a design into a non-optimized intermediaterepresentation that is used by the synthesis tool. The second step is the “logicoptimization”, which optimizes the internal representation by removing redundant logicand performing Boolean logic optimizations. The third step is called “technologymapping & optimization” which maps the internal representation to an optimized gatelevel representation using the technology library cells based on design constraints.[3]

In this chapter, we describe how the Synopsys Design_Compiler tool was utilized tosynthesize the verified AES128 model, by using a script that was developed to performthe synthesis based on certain constraints. The script generates several reports about thesynthesis outcome including timing and area estimates.

5.2 Synthesis Methodology

The first step in the synthesis process is to read all the components in the designhierarchy. There are three components in the 3-level design hierarchy that needs to besynthesized. Since the RTL model utilizes a SystemVerilog “Package”, then thesynthesis tool needs to enable the semantics of a package. In addition, the synthesis toolneeds to know if there are multiple instances of calling an automatic function in thedesign, to preserve separate values for each instance.

Page 32: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

The following Synopsys Design Compiler (DC) shell commands enable package andautomatic function utilizations:

set hdlin_sv_packages "enable"set hdlin_infer_function_local_latches "true"

Then, the package and the modules in the design hierarchy are read using the followingcommands:

read_file -format sverilog {./AES128_DUT_package.sv}read_file -format sverilog {./AES128_rcon.sv}read_file -format sverilog {./AES128_key_expand.sv}read_file -format sverilog {./AES128_cipher_top.sv}

After reading the design files, they are “Analyzed” and “Elaborated” throughwhich the RTL code is converted into the Synopsys Design Compiler internal format.The intermediate results are stored in the defined “working library”. The following DCcommands are used for these steps:

analyze -library WORK -format sverilog {./AES128_rcon.sv}analyze -library WORK -format sverilog {./AES128_key_expand.sv}analyze -library WORK -format sverilog {./AES128_cipher_top.sv}

elaborate AES128_rcon -architecture verilog -library WORKelaborate AES128_key_expand -architecture verilog -library WORKelaborate AES128_cipher_top -architecture verilog -library WORK

Then, the “dont_touch” attribute is removed from all the modules in the designhierarchy so that during the optimization phase the tool can modify the modules. Thefollowing DC command is used for this step:

remove_attribute [find design -hierarchy] dont_touch

After this step, a 40MHz clock signal is applied to the clock port of the rootmodule, and the synthesis tool is programmed not to modify the clock tree during theoptimization phase. In addition, an arbitrary input delay of 5ns with respect to the clockport is applied to all input and output ports (except the clock port itself) to set a safemargin by considering any unintended source of delay such as the delay associated withdriving module/modules.

Then, the design is constrained with hypothetical maximum area equal to zero toforce the tool to make the gate level netlist as compact as possible. The following DCcommands are used for these steps:

Page 33: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

create_clock -name clk -period 25 [find port intf_clk]set_dont_touch_network [find clock "clk"]

set non_clock_ports [remove_from_collection [all_inputs][get_ports intf_clk]]

set_input_delay 5 $non_clock_ports -clock clkset_output_delay 5 [all_outputs]

set_max_area 0

In the next steps, the tool is programmed to consider a unique design for each cellinstance by removing the multiply-instantiated hierarchy in the current design. Then, thesynthesis script removes the boundaries from all the components in the design hierarchyand removes all levels of hierarchy.

uniquifyset_boundary_optimization [find design -hierarchy] trueungroup -all -flatten -all_instances

Finally, the tool compiles the design with high effort and reports any warningrelated the mapping and final optimization step. At the end, the tool generates reports forthe optimized gate level netlist area, the worst combinational path timing, and anyviolated design constraint.

report_attribute > ./Synthesis_Reports_Attribute.txtreport_area > ./Synthesis_Reports_Area.txt

report_constraints -all_violators >./Synthesis_Reports_Constraint_Violaters.txt

report_timing -path full -delay max -max_paths 1 -nworst 1 >./Synthesis_Reports_Timing.txt

5.3 Synthesis Timing Result

The synthesis tool optimizes the combinational paths in a design. In General, fourtypes of combinational paths can exist in any design:

1- Input port of the design under test to input of one internal flip-flip

2- Output of an internal flip-flip to input of another flip-flip

3- Output of an internal flip-flip to output port of the design under test

Page 34: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

4- A combinational path connecting the input and output ports of the designunder test.

The last DC command in the script developed in previous section, instructs the toolto report the path with the worst timing. In this case, the path with the worst timing is acombinational path of type two. The delay associated with this path is the summation ofdelays of all combinational gates in the path plus the Clock-To-Q delay of the originatingflip-flop, which was calculated as 24.09ns. By considering the setup time of thedestination flip-flop in this path, which is 0.85ns, the 40MHz clock signal satisfies theworst combinational path delay. The delays of combinational gates, setup time of flip41flops and Clock-To-Q values are derived from the LSI_10k library file that was used forthe mapping step during synthesis. The synthesis timing report is shown below:

Report : timing-path full-delay max-max_paths 1Design : AES128_cipher_topVersion: Z-2007.03Date : Mon Nov 16 21:25:14 2009

Operating Conditions:Wire Load Model Mode: topStartpoint: u0/w3_reg[22](rising edge-triggered flip-flop clocked by clk)Endpoint: u0/w2_reg[27](rising edge-triggered flip-flop clocked by clk)Path Group: clkPath Type: max

Page 35: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core
Page 36: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

5.4 Synthesis Area Result

The synthesis area report shows the total number of cells and nets in the netlist. Italso uses the area parameter associated with each cell in the LSI_10K library file, tocalculate the total combinational and sequential area of the netlist. The total area of thegate level netlist is unknown since it depends on total area of the interconnects, whichitself is a function of the wiring load model used in physical design. The total cell area inthe netlist is reported as 22978 units, which is the sum of combinational and sequentialareas. The synthesis area report is shown below:

Page 37: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

5.5 Synthesis Constraint Violators Result

To enforce the synthesis tool to create the most compact netlist, the area of the gatelevel netlist was constrained to zero during the synthesis process. As a result, the onlyconstraint violation, which is expected, is related to the area as shown below:

Page 38: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

CHAPTER 6

AES128 SOFTWARE IMPLEMENTATION

6.1 Overview

The optimized gate level netlist generated after synthesizing the hardware model byusing the LSI_10K technology library can operate at a 40MHz clock signal. Since thehardware model takes ten clock cycles (for ten rounds of encryption) to encrypt a 128-bitblock, the overall delay for encrypting a block of plaintext is 250ns.

In order to compare the speed of the hardware implementation with that of asoftware implementation, the AES128 algorithm was modeled in “C” language. The “C”program was then run on a virtual system, and the statistics of the virtual system weregathered before and after encrypting a block of plaintext. The number of CPU cycles thatwere required on the virtual system to encrypt a block of plaintext was used to comparethe efficiency of software and hardware implementations.

6.2 AES128 Software Implementation on a Simics Virtual System

Simics is a complete functional simulation tool for creating virtual platforms thatsupports “single-core, multicore, multiple processor, and multiple machine configurations(racks, clusters, and distributed systems)”.

Simics supports several processor families (e.g. ARM, MIPS, PowerPC, x86) andruns the same binary software as the physical target system. “To the target software, thevirtualized target hardware behaves exactly the same as the physical target hardware.” [8]

In this project, the Simics software was used to create a virtual system based onIntel’s x86 architecture and the 440BX chipset. The target virtual system consisted of a2GHz Pentium4 processor and ran the Red Hat 7.3 Enterprise Linux operating system.

The “C” program implementing the AES128 encryption algorithm (See Appendix

Page 39: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

D) was ported to the Simics’s virtual system and then compiled to create the executablefile (object file). The virtual system’s statistics were gathered during the execution of the“C” program, before and after encrypting a block of plaintext. This was done by usingthe Simics’s “Magic” instruction that called a registered python function for gathering thevirtual system statistics. The portion of the “C” code for encrypting a block of plaintextis shown in Figure 4.7. Encrypting a block of plaintext involves copying the block to thestate, generating the round keys from the cipher key and performing ten rounds ofencryption on the state.

Figure 6.1 – AES128 Block Encryption Pseudo Code in “C”

The target virtual system statistics before and after encryption of a plaintext block issummarized in Table 6.2. The Callback1 and Callback2 statistics refer to the virtualsystem’s state before and after the encryption of the plaintext block, respectively.

Page 40: Design and Implementation of Low Area and Low Power Aes Encryption Harware Core

Table 6.2 – Simics Virtual System Statistics

The “User” and “Supervisor” columns refer to the number of instruction that wereexecuted in the user space and the system space, respectively. Since the clock perinstruction for the virtual target was assumed to be one (CPI=1), the total CPU cycleswas equal to the total number of instructions.

The results show that encrypting a block of plaintext in software takes more than30,000 CPU cycles of the virtual target system. Since the virtual system has a 2GHzPentium4 processor, the encryption of a plaintext block takes more than 15us, which is 60times slower than the proposed hardware implementation.