A High-Performance VLSI Architecture for Advanced ......Register 2 MUX Register 3 M U X M U X M U X 128 Initial key 0 32 S-box Left shift MUX Round constant enc/dec Initial / normal

8 December 2005VLSI Design 2006

1

A High-Performance VLSI Architecture for Advanced

Encryption Standard (AES) Algorithm

N. M. Kosaraju, M. Varanasi & Saraju P. MohantyVLSI Design and CAD Laboratory

Homepage: http://www.vdcl.cse.unt.eduUniversity of North Texas, Denton, TX, USA.

Email: [email protected]


2

Outline of the Talk

• Introduction• The Rijndael Algorithm• Related Work• Proposed Architecture• Prototype Implementation• Performance Analysis• Conclusions


3

Introduction

• Techniques like cryptography, steganography, watermarking, and scrambling, have been developed to keep data secure, private, and copyright protected.

• The need for secure transactions in ecommerce, private networks, and secure messaging has moved encryption into the commercial realm.

• In October 2000, Rijndael, developed by Joan Daemen and Vincent Rijmen was announced as the new encryption standard replacing data encryption standard (DES).


4

Rijndael Algorithm

• Rijndael algorithm is an iterative, symmetric block cipher with variable block length and variable key length.

• The number of rounds in the algorithm depends on the block length and the key length.

• The block length is specified to 128 bits (by NIST) and the key length can be either 128, 192 or 256 bits.

• The data block (B) and the key (K) are split into array of bytes (called State) and are represented in matrices arranged in a column major order.


5

Principle of Rijndael Algorithm

Initial RoundRound Key Addition

Basic RoundByteSubShiftRow

MixColumnRound Key Addition

Final RoundByteSubShiftRow

Round Key Addition

Nr-1 rounds

Round Key [0]

Round Key [round]

Round Key [Nr]

Plain Text

Cipher Text

3 distinct phases1. An initial data/key

addition. 2. Nine (128-bits),

eleven (192-bits) or thirteen (256-bits) standard rounds. Each round has a new round key with expanded key length Nb(Nr-1).

3. A final round


6

Order of Operations in Encryption and Decryption

INV Multiplicative Inverse, LT Linear Transformation, SR Shift Row, MC Mix Column, KA Key Addition, Nr = 10 for 128-bit input, key length

Initial Key

KA INV LT SR MC KA INV LT SR KA

Round Key Final Key

Normal Round(Nr – 1 times)

Initial Round

Final Round

Cipher Text

Plain Text

Encryption

Cipher Text KA

Initial Round

Final Key DecryptionInitial Key

SR LT INV KAFinal Round

Round Key

SR LT INV KA MCNormal Round(Nr – 1 times)

Plain Text


7

Related Work(Architectures for Rijndael)

• Kuo and Verbauwhede [2001]:– Encryption module to generate intermediate data. – A key scheduling module to generate the round keys.– Data encryption done at a rate of 1.82Gbps.

• McLoone and McCanny [2001]:– High performance single - chip FPGA implementation. – Supports different key sizes.– 192-bit key design run at 5.8Gbps & 256-bit key design run at 5.2Gbps.

• Mangard, et. al. [2003]:– Combinational paths are relatively short and balanced.– S-boxes have pipelined implementation using combinational logic. – The high performance versions achieved 241Mbps.

• Sodon, et. al. [2005]:– A low cost architecture using bit-serial approach. – FPGA based prototype has a maximum clock frequency of 510MHz with

a throughput of 0.37Gbs.


8

Salient Features of our Architecture• A high performance, high throughput and area efficient

VLSI architecture.• Architecture is optimized for high throughput in terms of

the encryption and decryption data rates using pipelining. • Polynomial multiplication is implemented using XOR

operation instead of using multipliers to decrease the hardware complexity.

• Both encryption and decryption modes use common hardware resources, thus making the architecture and corresponding implementation area efficient.

• Selective use of look-up tables and combinational logic further enhances the architecture’s memory optimization, area, and performance.

• An effective solution of online (real-time) round key generation needing significantly less storage for buffering.


9

Architecture : Data and Control Flow

•Architecture consists of data unit and key unit.

•Data Unit: (i) byte substitution, (ii) shift row, (iii) mix column, and (iv) round key addition

•Key Unit: Key scheduling and expansion.

Data and Control Flow

Output

Initial Key

Encryption/Decryption

ByteSubTransformation

ShiftRowTransformation

AddRoundKeyTransformation

Final Round/Initial Round

MixColumn Transformation

Input


10

Architecture : Modes of Operation

• Different modes of operations in which the block cipher algorithm can be implemented are :

– Electronic Code Book Mode (ECB)– Cipher Back Chaining Mode (CBC)– Cipher Feed-Back Mode (CFB)– Output Feed-Back Mode (OFB)

• For modes with feedback operations, pipelined design has no additional advantage since the encryption depends on the previous results.

– ECB mode of operation is chosen for our implementation.


11

Architecture: Pipelining and Looping

• The Rijndael algorithm is implemented in hardware considering the basic concepts:– Pipelining: Replicating same rounds and

placing registers in between.• Advantage: Increases the throughput.

– Iterative Looping: One round of hardware design, which forces the algorithm to reuse the same hardware.

• Advantage: Reduces the amount of area.


12

Operations Needed in Architecture

1. Byte Substitution Transformation 2. Shift Row Transformation3. Mix Column Transformation4. Key Addition Transformation


13

Architecture : Byte substitution• The Byte Substitution transformation is applied to

each byte individually and is a nonlinear byte-wise substitution. It consists of two phases:

– Multiplicative inverse of a state byte in GF(28)– An affine/inverse affine mapping transformation over

GF(2) for encryption/decryption

EncryptionMultiplicative inverse over

GF(28)

Linear affine mapping

over GF(28)

DecryptionMultiplicative inverse over

GF(28)

Inverse linear affine mapping

over GF(28)


14

Architecture : Shift Row

• The rows of the state matrix are cyclically shifted to the left during encryption and to the right during decryption by certain offset for each row.

– For a data block of length 128-bits, the offsets for each row are as follows:

• Row 0 is shifted by 0 bytes• Row 1 is shifted by 1 byte• Row 2 is shifted by 2 bytes• Row 3 is shifted by 3 bytes

– Shift Row transformation is implemented using combinational logic instead of look-up tables which allows for area minimization


15

Architecture : Mix Column• Mix Column transformation is applied to columns of the

state matrix, each column being considered as a polynomial over GF(28).– During encryption, each column is multiplied by a fixed

polynomial.– During decryption, each column is multiplied by a fixed

polynomial.• The multiplication by fixed polynomials over GF(28) is

implemented using XOR operation instead of the multipliers.

• The inverse mix column transformation is more complex than the mix column transformation, as the coefficients involved in the decryption polynomial are of higher order.


16

Architecture : Mix Column

Computation of Y or Z

IN0

XOR

Encryption / Decryption

8

IN1

XOR XOR

XOR MULT()

8

IN28

XOR XORIN3

8

XOR

MULT()

8T0

MULT()

MULT()

XOR

8T1

Y or Z

IN[i mod 4] 8

IN[(i+1) mod 4]8

XOR

MULT()

XOR

Y or Z

XOR

OUT[i mod 4]

Computation of Output

NOTE: MULT() is the multiplication of the byte by X over GF(28).


17

Architecture : Round Key Addition

• The state bytes and the appropriate round key generated by the key scheduling module are XORed.

XOR

B00(i)K00(i)

B00(i+1)

XOR

B01(i)K01(i)

B01(i+1)

XOR

B02(i)K02(i)

B02(i+1)

XOR

B31(i)K31(i)

B31(i+1)

XOR

B32(i)K32(i)

B32(i+1)

XOR

B33(i)K33(i)

B33(i+1)

…

Byte-1 Byte-2 Byte-3 Byte-14 Byte-15 Byte-16


18

Architecture : Different Rounds

• Standard round architecture has all four transformations:

1. Byte Substitution2. Shift Row3. Mix Column4. Round Key Addition

• Initial Round has (4)• Final Round has (1), (2), (4)

ByteSubMultiplicative inverse8 * 256 ROM cells

Affine Mapping

MUX Block

128

Inverse Affine MappingMUX

ShiftRow

MixColumnMUX Block XOR Block

RoundKeyAdditionRound KeyXOR Block

128


19

Architecture : Key Generation

• Key Generation has two parts– Key Expansion

• The initial key is represented as a linear array W, where K0=(W0,W1,W2,W3)

• The initial key is expanded into a linear array of 32-bit words of length Nb * (Nr -1 ).

– Key Scheduling• A round key of length 128 bits generated in every

clock cycle is given as input to the data unit of the encryption/ decryption module.


20

Architecture : Key Generation

MUX

Register 0

MUX

Register 1

MUX

Register 2

MUX

Register 3

MUX

MUX

MUX

128

Initial key 0 32

S-box

Left shift

MUX

Roundconstant

enc/dec

Initial / normal

Encryption

Decryption

Initial / normal

Initial / normal

Initial / normal

enc/dec enc/dec

Round key 0

Round key 1

Round key 2

Round key 3

128

enc/dec

32 32 32Initial key 1

Initial key 2

Initial key 3


21

Architectural Analysis• The forward and the reverse key scheduling is

implemented on the same device, thus allowing for area minimization.

• The generation of round key for each round takes 1 clock cycle.

• Decryption requires more cycles than encryption because it needs pre-scheduling to generate the last key value and the Inverse Mix Column transformation has a longer critical path compared to the Mix Column transformation.

• Round Keys are generated during the process when required, thus reducing the amount of storage for the buffer.

• Some of the modules need to be duplicated to get all the required operations done in one clock cycle for one round.


22

Resource Sharing between Encryption and Decryption

Inverse Affine Mapping

S-boxes

Affine Mapping Inverse Shift Row

Key Addition

Inverse Mix Column

Shift Row

Mix Column

Key Addition

KeyScheduling

Cipher TextPlain Text


23

Prototype Implementation : Layouts

Multiplicative Inverse Affine and Inverse Affine Mapping

Key Alignment

• The proposed architecture is custom designed using Cadence Virtuoso design layout with 0.35µ CMOS technology.

• The simulation tools used are Hspice.


24

Prototype Implementation : Summary

NA60128-bit Multiplexers

NA432-bit MultiplexersNA432-bit RegistersNA4S-Boxes

Key Unit160Multipliers

NA18032-bit Multiplexers384240Multiplexers16832-bit Registers1616S-Boxes

Data UnitMangard et al. [7]Our ArchitectureModule / Component


25

Prototype Implementation : Performance

24134Mangard et al [7] –High Performance

12864Mangard et al [7] -Standard

23211Proposed Architecture

Throughput (Mbps)Clock CyclesArchitecture

• Throughput = (Block length * Clock Frequency) / (Cycles per Block).

• Pipelined version of our architecture has throughput of 1.83Gbps


26

Conclusions• A VLSI architecture for the Rijndael, AES

algorithm is presented.• The key length and the data block length are

specified to 128 bits.• Feedback and pipelining architectures were

used for the implementation.• The algorithm was implemented in the ECB

mode of operation.• Pipelined architecture could process data at

1.83 Gbits/sec


27

Thank You!

A High-Performance VLSI Architecture for Advanced ......Register 2 MUX Register 3 M U X M U X M U X 128 Initial key 0 32 S-box Left shift MUX Round constant enc/dec Initial / normal

Documents