A BIT MANIPULATION : ARCHITECTURE APPLICATIONSzshi/course/cse5302/ref/yhilewitz_thesis.pdf · vanced bit manipulation functional unit to support bit gather, bit scatter and bit permutation

ADVANCED BIT MANIPULATION

INSTRUCTIONS: ARCHITECTURE,IMPLEMENTATION AND APPLICATIONS

YEDIDYA HILEWITZ

A DISSERTATION

PRESENTED TO THE FACULTY

OF PRINCETON UNIVERSITY

IN CANDIDACY FOR THE DEGREE

OF DOCTOR OF PHILOSOPHY

RECOMMENDED FOR ACCEPTANCE

BY THE DEPARTMENT OF

ELECTRICAL ENGINEERING

ADVISER: RUBY B. LEE

SEPTEMBER 2008

c© Copyright by Yedidya Hilewitz, 2008.

All Rights Reserved

Abstract

Advanced bit manipulation operations are not efficiently supported by commodity word-

oriented microprocessors. Programming tricks are typically devised to shorten the long

sequence of instructions needed to emulate these complicated bit operations. As these bit

manipulation operations are relevant to applications that are becoming increasingly impor-

tant, we propose direct support for them in microprocessors. In particular, we propose fast

bit gather (or parallel extract), bit scatter (or parallel deposit) and bit matrix multiply in-

structions, building on previous work which focused solely on instructions for accelerating

general bit permutations.

We show that the bit gather and bit scatter instructions can be implemented efficiently

using the fast butterfly and inverse butterfly network datapaths. We define static, dynamic

and loop invariant versions of the instructions, with static versions utilizing a much simpler

functional unit than dynamic or loop-invariant versions. We show how a hardware decoder

can be implemented for the dynamic and loop-invariant versions to generate, dynamically,

the control signals for the butterfly and inverse butterfly datapaths. We propose a new ad-

vanced bit manipulation functional unit to support bit gather, bit scatter and bit permutation

instructions and then show how this functional unit can be extended to subsume the func-

tionality of the standard shifter unit. This new unit represents an evolution in the design of

shifters.

We also consider the bit matrix multiply instruction. This instruction multiplies two

n × n bit matrices and can be used to accelerate parity computation and is a powerful bit

manipulation primitive. Bit matrix multiply is currently only supported by supercomputers

and we investigate simpler bmm primitive instructions suitable for implementation in a

commodity processor. We consider smaller units that perform submatrix multiplication

and the use of the Parallel Table Lookup module to speed up bmm.

Additionally, we perform an analysis of a variety of different application kernels taken

from domains including bit compression, image manipulation, communications, random

iii

number generation, bioinformatics, integer compression and cryptology. We show that us-

age of our proposed instructions yields significant speedups over a basic RISC architecture

– parallel extract and parallel deposit speed up applications 2.19× on average, while ap-

plications that benefit from bmm instructions are sped up 1.6×-5.0× on average for the

various bmm solutions.

iv

Acknowledgements

First, I would like to thank my advisor, Professor Ruby B. Lee. She helped guide my

research throughout my graduate career and without her dedication and assistance, this

thesis would not have been produced.

I would like to thank Professor Niraj K. Jha and Professor Zhijie Jerry Shi for reading

my dissertation and providing valuable feedback.

I also thank Dr. Yiqun Lisa Yin for her valuable comments and suggestions and fruitful

collaborations, and Roger Golliver for giving me the opportunity to intern for him at Intel

Corporation and for his numerous suggestions regarding the applicability of my work.

I would also like to express gratitude to the National Science Foundation, the Fannie

and John Hertz Foundation and Sir Gordon Wu for their generous financial support which

made my life as a graduate student much easier.

I am grateful to my fellow members in the PALMS research group, including Jef-

frey Dwoskin, Cedric Lauradoux, Reouven Elbaz, Zhenghong Wang, David Champagne,

Nachiketh Potlapally and Michael Wang, for their friendship, collaboration and support. It

was a pleasure working with them.

Finally, I would like to thank my parents and my wife, Mindy. I am grateful for the

love and support they have given me throughout, and I am especially grateful to Mindy for

ensuring that this thesis was completed in a timely manner. Thus I dedicate this thesis to

her.

v

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

1 Introduction 1

1.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Bit-Oriented Operations 6

2.1 Word Level, Single Bitfield Operations . . . . . . . . . . . . . . . . . . . . 7

2.2 Subword Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Specialized Permutations . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.2 General Permutations . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Bit Level Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.1 Bit Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Advanced Bit-Oriented Operations and their Usage in Emerging Applications 26

2.4.1 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.2 Linear Cryptanalysis . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.3 Software-Defined Radio . . . . . . . . . . . . . . . . . . . . . . . 31

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Advanced Bit Manipulation Operations 38

3.1 Instruction Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

vi

3.1.1 Bit Gather or Parallel Extract (pex) . . . . . . . . . . . . . . . . . 39

3.1.2 Bit Scatter or Parallel Deposit (pdep) . . . . . . . . . . . . . . . . 41

3.2 Datapaths for Parallel Extract and Parallel Deposit . . . . . . . . . . . . . . 42

3.2.1 Parallel Deposit on the Butterfly Network . . . . . . . . . . . . . . 42

3.2.2 Parallel Extract on the Inverse Butterfly Network . . . . . . . . . . 52

3.2.3 Need for Two Datapaths . . . . . . . . . . . . . . . . . . . . . . . 54

3.2.4 Decoding the Bitmask into Butterfly Control Bits . . . . . . . . . . 55

3.2.5 Hardware Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.3 Advanced Bit Manipulation Functional Unit . . . . . . . . . . . . . . . . . 67

3.3.1 Implementation of the Standalone Advanced Bit Manipulation Func-

tional Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.5 Appendix: Decoding the pex Bitmask into Inverse Butterfly Control Bits . 74

4 Advanced Bit Manipulation Functional Unit as a Basis for Shifters 80

4.1 Basic Shifter Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.2 Basic Shifter Operations on the Inverse Butterfly Datapath . . . . . . . . . 81

4.2.1 Determining the Control Bits for Rotations . . . . . . . . . . . . . 84

4.2.2 Determining the Control Bits for Other Shift Operations . . . . . . 92

4.3 New Shifter Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.3.1 Basic Shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.3.2 Full Shift-Permute Unit . . . . . . . . . . . . . . . . . . . . . . . 96

4.4 Evolution of Shifters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5 Bit Matrix Multiply 107

5.1 Definition of Bit Matrix Multiply Operation . . . . . . . . . . . . . . . . . 107

5.2 Bit Matrix Multiply Basic Operations . . . . . . . . . . . . . . . . . . . . 109

vii

5.3 Computing Bit Matrix Multiply using Existing ISA . . . . . . . . . . . . . 112

5.3.1 Conditional Row-wise Summation . . . . . . . . . . . . . . . . . . 112

5.3.2 Table Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.3.3 Implementation of bmm on Cray Vector Supercomputers . . . . . . 114

5.3.4 Comparison of Existing ISA Methods . . . . . . . . . . . . . . . . 114

5.4 New Architectural Techniques to Accelerate Bit Matrix Multiply . . . . . . 116

5.4.1 Submatrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . 116

5.4.2 Parallel Table Lookup . . . . . . . . . . . . . . . . . . . . . . . . 122

5.5 Comparison of Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.5.2 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.5.3 Implementation Cost . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6 Applications and Performance Results 130

6.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.1.1 Binary Compression and Decompression . . . . . . . . . . . . . . 132

6.1.2 Least Significant Bit Steganography . . . . . . . . . . . . . . . . . 133

6.1.3 Transfer Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.1.4 Integer Compression . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.1.5 Binary Image Morphology . . . . . . . . . . . . . . . . . . . . . . 135

6.1.6 Random Number Generation . . . . . . . . . . . . . . . . . . . . . 136

6.1.7 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.1.8 Cryptology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.1.9 DARPA HPCS Discrete Math Benchmarks . . . . . . . . . . . . . 139

6.1.10 Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.2 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

viii

6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7 Conclusions and Future Research 148

7.1 Advanced Bit Manipulation Instructions . . . . . . . . . . . . . . . . . . . 148

7.2 Advanced Bit Manipulation Functional Unit . . . . . . . . . . . . . . . . . 149

7.3 Advanced Bit Manipulation Functional Unit as Basis for the Shifter . . . . 149

7.4 Implementation of Bit Matrix Multiply in Commodity Processors . . . . . . 150

7.5 Application Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.6 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Bibliography 153

ix

Chapter 1

Introduction

With the rapid rise of computing power and with the ubiquity of computing devices, the

problems that computers are used to solved are continually evolving. Problems that were

deemed as computationally unfeasible are now within the realm of solvability. Usage sce-

narios that were implausible are now possible given advances in semiconductor manu-

facturing technology. Consequently, the capabilities of the microprocessors in computing

devices need to evolve as well.

Classically, microprocessors were designed so that basic operations, instructions, were

performed on basic data units termed processor words. Over time, the wordsize has grown

from 8 bits to 64 bits as the need and capability to operate on larger data sizes has expanded.

Instruction set architectures (ISAs), the set of instructions visible to the programmer, have

been adapted to add support for wider word sizes while still providing facilities to operate

on the smaller word sizes. For example, the IA-32 ISA [1] addition operation can add two

8-, 16-, 32- or 64-bit quantities. The Alpha ISA [2] addition operation operates only on 32-

or 64-bit quantities.

As workloads changed and multimedia grew in prominence, computers further evolved

to facilitate new applications. Multimedia processing consists of repeating similar opera-

tions on large arrays of small (8-bit or 16-bit) data. Thus, instructions were added to allow

1

computation not just on a single smaller element from a full processor word, but on a vector

of smaller elements packed into the processor word. An addition operation performs eight

adds of 8-bit quantities packed into 64-bit processor words. Lee introduced this concept

with PA-RISC MAX [3–5] and termed it subword parallelism. Most modern ISAs provide

support for subword parallelism [1, 3–15].

Workloads have again changed and processors now need to evolve once again to support

new and emerging workloads. In particular, the facility to operate on the bit level is becom-

ing of greater importance as exhibited by applications such as cryptography, cryptanalysis,

bioinformatics and software-defined radio. Since a microprocessor is optimized around

the processing of words or conveniently-sized subwords, it is not surprising that bit-level

operations are typically not as well supported by current word-oriented microprocessors.

While these can be built from the simpler logical and shift operations or performed using

programming tricks (cf. Hacker’s Delight [16]), the applications using these advanced bit

manipulation operations are significantly sped up if the processor can support more power-

ful bit manipulation instructions.

Supercomputers, such as the Cray vector machines [17], have had support for some of

these advanced bit manipulation operations, as applications such as cryptanalysis were once

relegated to supercomputers. As the classic supercomputer model has yielded to clusters of

commodity processors and as many emerging applications are performed in diverse settings

spanning workstations, desktop computers and low-powered handheld devices, we focus on

commodity processors.

Previous work on bit level instructions [18–27] considered only general bit permutation

instructions – instructions that facilitate achieving any of the nn arbitrary permutations

with repetition or n! permutations without repetition of an n-bit processor word. However,

specialized permutations, such as bit gather and bit scatter operations, may be more useful

for the new applications. Additionally, the combination of bits in a processor word (as

opposed to the translocation of bits), such as the parity operation which combines bits with

2

Exclusive-OR (XOR), is important for many emerging applications, and prior work has

not addressed this at all. Thus new advanced bit-oriented instructions that directly perform

these operations can significantly speed up important emerging applications.

1.1 Thesis Contributions

This thesis focuses on two sets of advanced bit manipulation instructions: bit gather and bit

scatter, and bit matrix multiply. We propose the novel parallel extract instruction and par-

allel deposit instruction to perform bit gather and bit scatter operations, respectively. We

show how these instructions can be performed using the butterfly and inverse butterfly dat-

apaths (used for the previously proposed bfly and ibfly permutation instructions), give

an algorithm to dynamically generate the configuration for these datapaths, and propose a

new advanced bit manipulation functional unit that supports these four instructions.

This thesis next shows how the proposed advanced bit manipulation functional unit can

be used to support the functionality of the shifter unit, which performs basic single field bit-

oriented operations such as shift, rotate, extract and deposit (insert). We give an algorithm

for configuring the datapaths for rotations, which is the basic operation from which the

other single-field operations can be derived. This new shift-permute unit represents an

evolution in the design of shifters from the classic barrel and log shifter architectures. We

compare these functional units using both standard cell and FPGA implementations.

Also considered is the bit matrix multiply operation, which multiplies two n × n bit

matrices. This operation is useful for accelerating parity operations and can be used to

perform a superset of the advanced bit manipulation operations. Cray supercomputers

have a bit matrix multiply instruction, which requires significant hardware resources. We

investigate lower cost alternatives more suitable for commodity processors and compare

these to the best current software-based methods and to the Cray implementation.

This thesis also presents an analysis of the usage of the advanced bit manipulation in-

3

structions in various applications. These include cryptography, cryptanalysis, bioinformat-

ics, software-defined radio, steganography, random number generation, compression and

image manipulation. We quantify the speedups resulting from use of the new instructions.

1.2 Thesis Organization

This thesis is organized as follows. Chapter 2 discusses bit-oriented operations, includ-

ing instructions present in both commercial ISAs as well as academic proposals. Word

level, subword level and bit level operations are discussed. Chapter 2 discusses emerging

applications and the operations needed for efficient execution of these applications, thus

providing the context and motivation for architectural enhancements to support advanced

bit manipulation operations.

Chapter 3 introduces the novel parallel extract and parallel deposit instruction, which

perform bit gather and bit scatter operations, respectively. Chapter 3 then investigates the

datapaths needed for implementing the new instructions and shows how a single functional

unit can be designed to support all of bit gather, scatter and permutation operations. This

chapter is based in part on work presented by the author in [28–31]

Chapter 4 shows how the advanced bit manipulation functional unit can be modified

to support the standard operations performed by the shifter in current microprocessors and

presents a detailed description of the complete functional unit. This chapter is based in part

on work presented by the author in [32, 33].

Chapter 5 discusses the bit matrix multiply operation. Bit matrix multiply is defined and

software-based methods for computing the operation are discussed. Then, new architectural

techniques for computing bit matrix multiply are proposed. This chapter is based in part on

work presented by the author in [34–36].

Chapter 6 delineates the applications that benefit from advanced bit manipulation op-

4

erations and presents performance results which quantify the speedup obtained using the

new instructions.

Chapter 7 concludes the thesis and discusses directions and opportunities for future

research.

5

Chapter 2

Bit-Oriented Operations

Bit patterns held in processor registers can be interpreted by instructions in two ways.

First, they can be viewed as having a meaning, such as representing text characters or

numbers, whether integer or floating point, single word or multiple subwords. Instructions

then perform string processing operations on text characters or arithmetic operations on

these numbers. Alternatively, processor registers can simply hold bit patterns that do not

necessarily have a higher level, abstract interpretation. Instructions then perform operations

on bits and bit fields. In this chapter, we detail the various subcategories of the latter type

of instructions, moving from instructions that operate on a word level on single bit fields, to

instructions that operate on multiple bit fields held in subwords to instructions that operate

on the individual bits.

A breakdown of the subcategories is shown in Figure 2.1 and the instructions are listed

in Table 2.1. In our analysis, we include instructions from the POWER, IA-32/AMD64

and Itanium ISAs, as well as proposed instructions from academic researchers. We note

that instructions listed generally omit the part of the opcode that refer to the size of sub-

words. Also note that, in the text, POWER includes integer and vector (AltiVEC) opera-

tions [15]; IA-32 includes x86, x86-64, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2

and AVX [1, 37]; and AMD64 includes x86, x86-64, 3DNow!, MMX, SSE, SSE2, SSE3,

6

SSE4a and SSE5 [38, 39].

We show that there currently are no implementations of specialized advanced bit manip-

ulation instructions in commercial processors and recently proposed bit-level instructions

in the research community focus on efficient computation of general, arbitrary permuta-

tions [19–27]. Additionally, support for combining bits in processor words, such as the

parity operation that combines bits with XOR, is lacking – current ISAs only have lim-

ited parity computation features. Notwithstanding the lack of support for these operations,

there is a need for these specialized instructions for important emerging applications such

as bioinformatics, linear cryptanalysis and software-defined radio. Thus this thesis pro-

poses new instructions to perform these operations.

This chapter is organized as follows. Section 2.1 describes bit-oriented instructions that

operate on a single field of bits. Section 2.2 discusses instructions that manipulate bits in

subwords in parallel, including both specialized and general rearrangement instructions.

Section 2.3 deals with instructions that operate on an individual bit basis including instruc-

tions to perform general bit permutations. The reader can choose to refer to Figure 2.1

and Table 2.1 for a summary of these instructions and read the sections in detail only as

desired. Section 2.4 highlights the lack of support for certain specialized bit manipulation

operations and parity computation, and discusses the applications that make use of these

operations. Section 2.5 summarizes this chapter.

2.1 Word Level, Single Bitfield Operations

Instructions that operate on a word level, performing operations on a single bit field, are

present in all instruction set architectures. The most basic example is the standard shift

instruction which moves a single field of bits to the right or the left (Figure 2.2(a)). The

emptied bit positions are zero-filled on the right for left shifts and zero-filled or filled with

the sign bit on the left for right shifts. The shift amount is specified statically, in an imme-

7

Figure 2.1: Classification of Types of Bit-Oriented Instructions by Word, Subword or Bit Level

8

Tabl

e2.

1:B

it-or

ient

edop

erat

ions

inva

riou

sIS

As

Inst

ruct

ions

type

sPO

WE

RIA

-32/

AM

D64

Itan

ium

Oth

er

Wor

d

Bas

icsh

ift(sl

,sr

,sra

)sh

ift

(shl

,sh

r,sa

l,sa

r),

ro-

tate

(rol

,ror

)sh

ift(

shl,

shr,

shr.u

)

Com

posi

tero

tate

-and

-cle

ar(rlc

),ro

tate

-mas

k-an

d-in

sert

(rlmi

),do

uble

shif

t(vsldoi

)

subw

ord

extr

act

(pextr

i ,extract

i ),

extr

act

(extrq

a),

subw

ord

inse

rt(pins

i ,insert

i ),

inse

rt(insertq

a),

doub

lesh

ift

(shld

,shrd

,palignr

i )

extr

act

(extr

,extr.u

),de

posi

t(dep

,dep.z

),sh

iftp

air(shrp

)

Ana

lysi

sco

unt

lead

ing

zero

s(cntlz

)po

pula

tion

coun

t(popcnt

),co

unt

lead

-in

gze

ros

(bsr

,lzcnt

a),

coun

ttra

iling

zero

s(bsf

)

popu

latio

nco

unt(popcnt

)

Oth

ercrc32

i ,A

ES

encr

ypt/d

ecry

pti

(aesenc

,aesdec

)

Subw

ord

Para

llel

Ope

ra-

tions

para

llel

shif

tan

dro

tate

(vr

,vsl

,vsr

,vsra

),pa

ralle

lpo

pula

tion

coun

t(popcntb

)

para

llel

shif

tan

dro

tate

(pssl

,psrl

,psra

,psha

a,pshl

a,prot

a)

para

llels

hift

(pshl

,pshr

,pshr.u

)group-withdraw

,group-deposit

[18]

Spec

ializ

edm

erge

(vmrg

)un

pack

(unpck

,punpck

)unpack

,mix

mixpair

[40]

Rea

rran

gem

ent

sele

cti

(blend

,pblend

)se

lect

(check

[41]

)re

vers

e(bswap

,pswapd

a)

reve

rse

(mux1

)sp

lat(vsplt

)du

plic

ate

(movdup

),br

oad-

cast

(vbroadcast

i )br

oadc

ast(mux1

)

mis

c.(mux1

)exchange

[41]

,excheck

[41]

,group-shuffle

[18]

,group-swizzle

[18]

Gen

eral

Rea

r-ra

ngem

ent

vperm

pshufb

i ,pshufw

,pshufd

,shuf

,perm

a,

pperm

a,vperm

i

mux2

select-bytes

[18]

9

Inst

ruct

ions

type

sPO

WE

RIA

-32/

AM

D64

Itan

ium

Oth

er

Bit-

leve

l

Bit

Slic

elo

gica

l,se

lect

(vsel

)lo

gica

l,se

lect

(pcmov

a)

logi

cal

Subs

etPa

rity

prty

Thi

sThe

sis

Spec

ializ

edR

e-ar

rang

emen

tm

ask

extr

act

(pmovmskb

,movmsk

)T

hisT

hesi

s

Gen

eral

grp

[20,

21],

Rea

rran

gem

ent

cross

[20,

22],

omflip

[20,

23]

bfly

,ibfly

[24–

26],

bpi

[27]

xbox

[19]

,pperm

[20]

,swperm

,sieve

[42,

43]

group-8-mux

,group-transpose

-8-mux

[18]

i Int

el-o

nly

exte

nsio

ns;a

AM

D-o

nly

exte

nsio

ns

10

diate field in the opcode, or dynamically, with a second input register.

Another single-field instruction is the rotate instruction. The rotate instruction shifts

bits to the right or left, except the bits that are shifted out are instead wrapped around

(Figure 2.2(b)).

(a) (b)

Figure 2.2: (a) shl r1 = r2, s (b) rotl r1 = r2, s

The shift and rotate instructions are the most basic instructions that manipulate a single

bitfield. However, some ISAs have more advanced instructions that combine multiple in-

struction operations into a single instruction. The most common are the shift-and-mask and

shift-and-mask-and-merge paradigm. Among these are instructions that extract (or with-

draw) a byte or subword – the specified byte or subword is shifted to the least significant

position and then the other bits are zeroed out. There are also instructions that insert (or

deposit) a byte or subword. Instructions like these include the IA-32 pextr and pinsr

instructions that operate on integer subwords and extract and insert that operate on

floating point subwords.

Other ISAs such as Itanium and AMD64 include general field extract and deposit or

insert instructions (Itanium – extr/dep, AMD64 – extrq/insertq) . The extract op-

eration selects a single field of bits of arbitrary length from any arbitrary position in the

source register and right justifies that field in the result (Figure 2.3(a)). Extract is equiva-

lent to a shift right and mask operation. In Itanium, extract has both unsigned and signed

variants. In the latter, the sign bit of the extracted field is propagated to the most-significant

bit of the destination register.

The deposit operation takes a single right justified field of arbitrary length from the

11

source register and deposits it at any arbitrary position in the destination register (Fig-

ure 2.3(b)). Deposit is equivalent to a left shift and mask operation. In Itanium, there are

two variants of deposit: the remaining bits can be zeroed out, or they are supplied from

a second register, in which case a masked merge is required. Extract and deposit were

originally introduced in the PA-RISC ISA [44, 45].

(a) (b)

Figure 2.3: (a) extr r1 = r2, pos, len (b) dep r1 = r2, pos, len

The Power ISA has even more general rotate-and-clear and rotate-and-mask-and-insert

instructions that rotate an operand and then perform a mask operation or a masked-merge

operation on single arbitrary field of bits. This instruction can be used for shift, rotate, zero

extension, single field extract and insert among other operations.

Another advanced single field instruction is the double shift or shift pair instruction

(PA-RISC – shd, IA-32/AMD64 – shd, Itanium – shrp). This instruction is a variation

of the shift instruction and it concatenates two n-bit processor words, shifts the bits right (or

left) by a specified amount and then outputs the low (or high) n bits (Figure 2.4). Similar

instructions that concatenate two vector words and shift by an integer number of bytes

exist in IA-32 (palignr) and POWER (vsldoi – Vector Shift Left Double by Octet

Immediate). These instructions are useful for processing data streams when the data can

cross word boundaries. The double shift instruction can also be used to perform rotations

when the same word is used as both input operands.

Other instructions that operate on a word level are designed to obtain information about

the bit patterns in the word. There are instructions that count the number of one bits in

a word (population count, popcnt in IA-32/AMD64 and Itanium), and instructions that

12

Figure 2.4: shrp r1 = r2, r3, s

count the number of leading or trailing zero bits in a word or, alternatively, give the index of

the first one bit, such as the bit scan reverse (bsr) and bit scan forward (bsf) instructions

in IA-32/AMD64. Other operations include parity. Parity is implicitly computed in IA-

32/AMD64 – the PF flag is set to the (odd) parity of the least significant byte of the result

of an instruction. Notably, none of the POWER, IA-32/AMD64 or Itanium ISAs has an

instruction to compute the parity of a whole word.

Recently, even more specialized instructions have been introduced in IA-32 that per-

form AES encryption or compute the cyclic redundancy checksum of a processor word.

2.2 Subword Operations

Subword parallel ISA extensions include instructions for manipulating subwords. Often

these are also called multimedia extensions, as multimedia workloads motivating the in-

troduction of new instructions. These ISA extensions were first introduced in the 1990s

with PA-RISC MAX-1 [3, 4] followed by Sun SPARC VIS [9], Intel IA-32 MMX [6] and

IBM PowerPC AltiVec [10]. These extensions evolved over time – PA-RISC added MAX-

2 [5] and IA-32 added SSE [8]. Modern subword parallel extensions include those built

into the Itanium ISA [11] (which can be veiwed as the successor to PA-RISC), further SSE

extensions from Intel [1, 37] and AMD [38, 39], and VIS [14] and AltiVec [46] additions.

The most basic operations are the extensions of the single field manipulations to sub-

words. For example, POWER, IA-32/AMD64 and Itanium all include parallel shift in-

13

structions which perform a standard shift operation on each subword in parallel. In IA-

32 and Itanium all the subwords are shifted by the same amount, while in POWER and

AMD64 each subword is shifted by the count in the corresponding subword in a second

input operand. POWER and AMD64 also contain parallel rotate instructions. The Mi-

croUnity MediaProcessor architecture [18] has instructions to perform withdraw (extract)

and deposit operations on every subword. The POWER ISA also has an instruction to com-

pute the population count of each byte of a word and store the count in the corresponding

byte of the result.

The ISA extensions also contain instructions for the rearrangement or permutation of

subwords.

2.2.1 Specialized Permutations

The rearrangement instructions often are specialized to perform a particular type of permu-

tation, such as interleaving subwords from two words. For example, the mix instruction

is one such subword permutation operation, initially designed to accelerate multimedia ap-

plications operating on subwords of 8, 16 or 32 bits, packed into a word. mix.left

selects the left subwords from each pair of subwords alternating between the two source

registers r2 and r3 (Figure 2.5(a)). mix.right does the same on the right subwords of the

two source registers (Figure 2.5(b)). The mix instruction was first introduced in PA-RISC

for multimedia acceleration [5], and also appears in Itanium, where it is implemented in a

separate multimedia functional unit. No ISA currently supports mix for subwords smaller

than a byte, although this is very useful, e.g., for bit matrix transposition and fast parallel

sorting [47].

Another specialized permutation instruction that interleaves subwords is the unpack

instruction found in ISAs such as IA-32/AMD64 and Itanium or the merge instructions

found in POWER. unpack or merge interleaves subwords from the high or low half

14

(a) (b)

Figure 2.5: (a) mix.left r1 = r2, r3; (b) mix.right r1 = r2, r3.

of two words (Figure 2.6). unpack can be used to expand small subwords into larger

ones, as can the mix instruction. (Note these ISAs also include instructions that increase

the precision of the subwords. We do not include these here as they are more similar to

arithmetic instructions than to bit field operations.)

(a) (b)

Figure 2.6: (a) unpack.h r1 = r2, r3; (b) unpack.l r1 = r2, r3.

There are also specialized instructions for selecting subwords from two words. The

check instruction, proposed by Lee in [41], combines two words by alternating subwords

between words (Figure 2.7) – selecting subwords in a checkerboard fasion. The IA-32

pblend instructions combine two words by selecting subwords from the words under

control of a mask.

Other specialized permutation instructions include:

• bswap (IA-32/AMD64), which reverses the byte order of a word, and pswapd

Figure 2.7: check r1 = r2, r3

15

(AMD64), which reverses the order of 32-bit subwords;

• mux1 (Itanium), which performs a small set of specialized byte permutations includ-

ing byte reverse, least significant byte broadcast, and mix, alt and shuffle operations

across the two halves of a word (Figure 2.8);

• splat (POWER), which broadcasts any subword or immediate to all other sub-

word positions, movdup (IA-32/AMD64) which duplicates two single or one double

precision floating point values and broadcast (IA-32) which broadcasts a single

floating point value in memory to all 32-bit subword positions;

• mixpair (proposed by Lee in [40] and [41]), which combines mix.left and

mix.right into one instruction (this instruction requires 2 result registers);

• exchange (proposed by Lee in [41]), which exchanges neighboring subwords in a

word;

• excheck (proposed by Lee in [41]), which exchanges neighboring subwords across

two words (a combination of exchange and check);

• group-shuffle (MediaProcessor), which performs a shuffle operation on sub-

words based on immediate parameters x, y and z which control the size in bits over

which symbols are shuffled, the size of symbols and the degree of shuffling, respec-

tively. Bit j of the 128-bit output r1 is sourced from bit u of two 64-bit inputs r2||r3

where j = i6...x||iy+z−1...y||ix−1...y+z||iy...0; and

• group-swizzle (MediaProcessor), which performs a swizzle operation on sub-

words based on immediate parameters icopy and iswap. Bit j of the 128-bit output

r1 is sourced from bit i of the 128-bit input r2 where j = (i & icopy)⊕ iswap.

16

(a) (b)

(c) (d)

(e)

Figure 2.8: mux1 r1 = r2(a) @rev (b) @mix (c) @shuf (d) @alt (e) @brcst.

2.2.2 General Permutations

The subword parallel ISAs also include general permute instructions. These instructions

can arbitrarily reorder the subwords of 1 or 2 words, including repetition. These instruc-

tions include:

• vperm (POWER), pshufb (IA-32), pperm (AMD64) and select-bytes (Me-

diaProcessor) for arbitrary byte permutations;

• pshufw (IA-32/AMD64) and mux2 (Itanium) for 16-bit subwords;

• pshufd (IA-32/AMD64) for 32-bit subwords; and

• perm (AMD64), shuf (IA-32/AMD64) and vperm (IA-32) for floating point val-

ues.

vperm (POWER), pperm and perm permute the bytes or floating point values of 2

words (i.e., the k output bytes or floating point values are selected from among 2k input

bytes or floating point values). vperm (POWER), pperm and perm use a third input

17

operand to contain a list of indices (POWER and AMD64 SSE5 are 3-input operand ar-

chitectures). Each byte or subword in the third operand contains the index of the byte or

subword to be written to that position. The pperm instructions can also, based on the high

3 bits of the index byte, invert, bit reverse, invert and bit reverse, replicate the sign bit or

replicate the inverse of any source byte before writing the destination byte or it can zero

the destination byte or set it to all ones. The perm instruction can also compute the ab-

solute value, the negative value or the negative of the absolute value of the input floating

point value before writing the destination value, or it can set the destination to +0.0, -1.0,

+1.0 or π. pshufb and select-bytes only permute the bytes of 1 word and use the

second input operand to hold a list of indices that describe the source of each output byte

(Figure 2.9). The pshufb instruction can also zero any byte by setting the high bit of the

corresponding index byte to “1”.

Figure 2.9: byte permute r1 = r2, r3

vperm (IA-32) permutes the floating point values of 1 or 2 words, but the values are

restricted to being chosen from within the corresponding half of the register(s) (for 256-bit

registers). vperm (IA-32) uses a third input operand or an immediate to hold a list of

indices (IA-32 AVX is also a 3-input operand architecture). The instruction can also set a

value to all zeros.

pshufw, mux2, pshufd permute the values in 1 word and use an 8-bit immediate in

the opcode that specifies how the subwords are permuted. pshufw only permutes half of

the 128-bit register at a time, as only four 16-bit source fields can be specified in the 8-bit

immediate.

18

The shuf instructions permute the floating point values in 2 words, but each half of the

result contains values from only one of the inputs. An immediate in the opcode specifies

how the subwords are permuted.

2.3 Bit Level Instructions

Other sets of instructions address individual bits. Some instructions operate on a bit-slice

level. In this view, an n-bit processor acts like n 1-bit processors in parallel. The classic

logic instructions (and, or, xor, etc.) operate in this mode – each bit of output is the

logical combination of the bits in the corresponding position of the inputs. The POWER

vsel and the AMD64 pcmov instructions select between two input words on a bit-slice

level – each bit of output is selected from the bits in the corresponding position of the inputs

depending on the value of the bit in the same position of a third input (Figure 2.10).

Figure 2.10: vsel r1 = r2, r3, r4

Instructions can also operate on collections of individual bits. There are instructions

that compute the parity of a fixed subset of the bits, as in the POWER instruction prty

which computes the parity of the least significant bit of the bytes of a word (Figure 2.11).

The pmovmskb IA-32/AMD64 instruction extracts the most significant bits of the bytes of

a word (Figure 2.12). movmsk (IA-32/AMD64) does the same for the sign bits of packed

floating point values. However, there are no instructions that can compute the parity of an

arbitrary subset of bits (or the entire word, as mentioned earlier), extract an arbitrary bit

pattern or insert bits in arbitrary positions.

19

Figure 2.11: prty r1 = r2

Figure 2.12: pmovmskb r1 = r2

2.3.1 Bit Permutations

Another operation on individual bits is the arbitrary rearrangement of bits in a processor

word. Bit permutation is very difficult for word-oriented processors to compute. However,

these rearrangements are important for a wide variety of fields and for block ciphers, in

particular, where they are used to achieve diffusion. Diffusion spreads the redundancy of

the plaintext across large sections of the ciphertext. For example, the DES cipher [48],

has a permutation in each round. Due to the lack of hardware support for permutations,

the Advanced Encryption Standard finalists [49–53], for which good software performance

was a prerequisite, generally do not employ permutations in the round functions.

To perform a permutation in software, each bit has to be isolated, shifted to the per-

muted position and ored with the result. This requires approximately 4n instructions

(mask load/generation, and, shift and or for each bit). For a given, fixed permuta-

tion, a set of lookup tables can be constructed where each table contains the permutation

of a subset of bits. For example, to permute a 64-bit word, eight 256 entry tables are con-

structed that describe how to permute the bits in each of the eight bytes of the word. The

20

results of the lookups are ored together to produce the final permutation. Using lookup ta-

bles, 23 instructions are required for a permutation (8 byte extract, 8 lds and 7 ors).

However, this method may require many tens or hundreds of cycles if the tables are not in

cache. Additionally, a set of tables is needed for each distinct permutation.

Consequently, ISA designers have proposed new instructions to allow for arbitrarily

manipulating bits in a fashion similar to the arbitrary manipulation of subwords (Sec-

tion 2.2.2). There are two classes of instructions, to handle permutations with and without

repetitions.

Specifying any one of the n! arbitrary permutations without repetition of an n-bit word

requires lg(n!) bits, which is ≤ n × lg(n) bits. Using instructions that conform to the

standard microprocessor datapath, i.e., two n-bit input operands and a single n-bit result,

a sequence of lg(n) instructions are required. Each of these instructions inputs either the

initial data or a partially permuted intermediate result as one operand and the next n bits of

the permutation description as the other operand. After lg(n) iterations, the final permuted

result is obtained. For example, in a 64-bit microprocessor where n = 64, any one of

the 64! arbitrary permutations of the 64 bits in a register can be achieved with a sequence

of at most 6 instructions. Three potential instructions that fit this model were previously

proposed by the PALMS (Princeton Architecture Laboratory for Multimedia and Security)

group [54]: grp, cross and omflip.

grp – In [20, 21] the group (grp) bit permutation instruction was defined. grp is a

permutation primitive that gathers to the right the data bits from r2 selected by “1”s in the

mask r3, and to the left those selected by “0”s in the mask (see Figure 2.13). It was shown

that arbitrary bit permutations can be accomplished by a sequence of at most lg(n) of these

grp instructions.

The grp instruction has other useful properties. It has good cryptographic characteris-

tics and thus is a good candidate to use as a primitive operation in future block ciphers [55].

21

Figure 2.13: grp r1 = r2, r3

It can also be used for a fast subword radix sorting algorithm [47]. The grp instruction is

also scalable – it is efficient at performing permutations on 2n bits [20]. However, the grp

functional unit has a long latency [56].

cross – The cross instruction [20, 22] permutes bits using the Benes network [57],

a general permutation network. The Benes network consists of the concatenation of a

butterfly network and an inverse butterfly network. The structure of the networks is shown

in Figure 2.14, where n = 8. The n-bit networks consist of lg(n) stages and thus the full

Benes network has 2 × lg(n) − 1 stages (one stage is eliminated due to the equivalence

of the first and last stages of the two subnetworks). Each stage is composed of n/2 2-

input switches, each of which is constructed using two 2:1 multiplexers. In the ith stage (i

starting from 1), the input bits are n/2i positions apart for the butterfly network and 2i−1

positions apart for the inverse network. A switch either passes through or swaps its inputs

based on the value of a control bit. Thus n/2 control bits are required per stage.

The cross instruction permutes bits using two stages of the Benes network. One of

the two input operands holds the data to be permuted while the other input operand is used

to hold the control bits for two stages (the other stages are configured to pass through the

bits). An additional sub-opcode is used to indicate which two stages are used. lg(n) cross

instructions are required for any arbitrary permutation of n bits.

omflip – The omflip instruction [20, 23] permutes bits using 2 stages of omega and

flip networks, which are isomorphic to butterfly and inverse butterfly networks, respec-

22

(a) (b)

Figure 2.14: (a) 8-bit butterfly network and (b) 8-bit inverse butterfly network.

tively. The advantage of omega and flip networks is that the stages are all identical. Similar

to cross, the control bits for the 2 stages are supplied using an input operand and lg(n)

omflip instructions are needed for arbitrary n-bit permutations.

While the above instructions can be used to achieve any arbitrary permutation of n bits

in lg(n) instructions, for n being the word-size of the processor, other bit permutation

primitives that can do this in only 1 or 2 instructions. These are the butterfly (bfly)

and inverse butterfly (ibfly) pair of instructions [24–26] and the general bit permutation

instruction (bpi) [27]. These instructions both use the Benes network to permute bits. The

bpi instruction uses the full network while bfly and ibfly use the butterfly and inverse

butterfly subnetworks, respectively, separately. Consequently, only a single execution of

bpi or bfly followed by ibfly is required to compute any of the n! permutations of n

bits.

As the Benes network requires n/2×(2× lg(n)−1) configuration bits (n/2×2× lg(n)

when considering the butterfly and inverse butterfly networks separately), clearly these in-

structions do not follow the standard 2-input, 1-result model. A number of solutions to

this issue are possible [25, 26]. In particular, we focus on the solution in which extra reg-

isters are associated with the functional unit and hold the control bits. Some ISAs have

pre-defined registers that can be used as these extra registers such as the application reg-

23

isters in Itanium. In this paper, we will adopt this nomenclature and call them application

registers, or ARs. For n = 64, three 64-bit application registers are needed to hold the

configuration bits for each of the butterfly and inverse butterfly networks, while 352 bits of

extra application register storage are needed for the entire Benes network used by the bpi

instruction. These registers are loaded n or 2n bits at a time from general purpose registers.

There are a number of tradeoffs for either using the entire Benes network in one instruc-

tion (bpi) or splitting it into its constituent butterfly and inverse butterfly networks (bfly

and ibfly). Each of the butterfly and inverse butterfly networks are faster than an ALU

of the same width, since an ALU also has lg(n) stages which are more complicated than

those of the butterfly and inverse butterfly circuits. Assuming a processor cycle to be long

enough to cover the ALU latency, each bfly and ibfly instruction will have single cy-

cle latency, since these circuits are simpler than an ALU’s circuit. The full Benes network,

on the other hand, may have greater latency than the ALU, and thus the bpi instruction

may have 2 cycle latency. Additionally, there are a great many simple permutations that

require only one of the butterfly or inverse butterfly subnetworks. Use of bpi for these

permutations still requires routing the data bits through both subnetworks. One disadvan-

tage of bfly and ibfly is that two instructions must be issued for general permutations.

However, given superscalar resources, this could have negligible effect on bit permutation

workloads.

New instructions for performing bit permutations with repetitions have also been pro-

posed. The xbox instruction [19] and pperm instruction [20] (not to be confused with the

AMD64 pperm instruction) were introduced to accelerate permutations in symmetric-key

ciphers. These instructions takes 2 source operands – the data to be permuted and a list of 8

indices (for 64-bit registers) that describe how to produce a byte of the permutation, similar

to the subword permute instructions listed in Section 2.2.2. 8 of these instructions (and 7

or instructions) are required to permute a 64-bit word. The pperm3r instruction [21], is

24

a variation of pperm in which the byte of the permutation produced by the instruction is

ORed with the partially permuted result supplied with a third input operand, obviating the

need for an explicit or instruction.

The swperm and sieve instructions [42, 43] can also produce arbitrary bit permuta-

tions with repetitions. The swperm instruction is similar to the byte permute instructions

in commodity ISAs mentioned above except that it operates on 4-bit subwords rather than

bytes. The sieve instruction permutes bits within each 4-bit subword. Using swperm

and sieve, an arbitrary permutation with repetitions of 64 1-bit subwords can be per-

formed with 11 instructions and an arbitrary permutation of 32 2-bit subwords can be per-

formed with 5 instructions.

The MicroUnity MediaProcessor architecture [18] also has a bit permutation instruction

group-8-mux. This instruction operates on each byte of the input operand, performing

any permutation with repetitions of the bits within a byte. As an index for the source of each

bit is specified, 64×3 index bits are required. The MediaProcessor is a 128-bit architecture

and supports 3 input operands, so all these bits are supplied in a single instruction. The

instruction group-tranpose-8-mux also transposes the input (interpreting the 128-bit

data input as two 8×8 bit matrices) before permuting the bits within each byte. A sequence

of three such permutation instructions is equivalent to first independently permuting each

row of the 8×8 matrix, then permuting each column of the matrix and then permuting each

row again. This sequence of operations can achieve any permutation of the 64 bits. The

same permutation is performed on both 64-bit subwords in the first register. Note that this

is fairly expensive in terms of register usage.

25

2.4 Advanced Bit-Oriented Operations and their Usage in

Emerging Applications

In the above classification, we noted three operations which are neither supported by com-

mercial ISAs nor performed by proposed academic architectures. These are parity compu-

tation of entire words or arbitrary subsets of bits in a word, the extraction from arbitrary

positions and the insertion of bits to arbitrary positions (or bit gather and bit scatter opera-

tions, Figure 2.15). While the existing parity instructions and the proposed bit permutation

instructions can be used to perform these operations at some cost, direct support for these

operations would greatly benefit emerging applications. We now describe these emerging

applications and highlight how parallel generalized parity operations and bit gather and

scatter operations play integral roles in executing these applications. These applications

are reviewed in Chapter 6 when we detail all the applications that benefit from the new

instructions proposed in Chapters 3 and 5.

(a) (b)

Figure 2.15: (a) Bit gather operation; (b) Bit scatter operation.

2.4.1 Bioinformatics

Bioinformatics refers to the usage of computational and statistical techniques to solve bi-

ological problems. Major segments of bioinformatics research concern sequence analysis

– DNA or protein sequences are analyzed and compared to find genes, to show functional

similarity or determine common ancestry, among other goals.

DNA is a strand of the nucleotide bases adenine, cytosine, guanine and thymine. A

DNA sequence is the representation of a strand using a string containing the 8-bit ASCII

26

characters A, C, G and T. However, a 2-bit encoding of the nucleotides is more efficient

and can significantly increase performance of matching and searching operations on large

genomic sequences (the human genome contains 3.2 billion nucleotides). The ASCII codes

for the characters A, C, G and T differ in bit positions 2 and 1 (A: 00, C: 01, G: 11, T: 10)

and these two bits can be used to encode each nucleotide [58]. Thus a fourfold compression

of a genomic sequence simply requires a single bit gather operation to select bits 2 and 1

of each byte of a word (Figure 2.16).

Figure 2.16: Compression of sequence ATTCGCAC.

When aligning two DNA sequences, certain algorithms such as the BLASTZ pro-

gram [59] use a “spaced seed” as the basis of the comparisons. This means that n out of m

nucleotides are used to start a comparison rather than a string of n consecutive nucleotides.

The remaining slots effectively function as wild cards, often causing the comparison to

yield better results. For example, BLASTZ uses 12 of 19 (or 14 of 22 nucleotides) as the

seed for comparison. The program compresses the 12 bases and uses the seed as an in-

dex into a hash table. This compression is a bit gather operation selecting 24 of 38 bits

(Figure 2.17).

The DNA sequence is transcribed by the cell into a sequence of amino acids or a protein.

Often the analysis of the genetic data is more accurate when performed on a protein basis,

such as is done by the BLASTX program [60]. A set of three bases, or 6 bits of data,

corresponds to a protein codon. Translating the nucleotides to a codon requires a table

lookup operation using each set of 6 bits as an index. An efficient algorithm can use a bit

scatter operation to distribute eight 6-bit fields on byte boundaries, and then use the result

as a set of table indices for a parallel table lookup instruction [61–63] to translate the bytes

27

(Figure 2.18).

Thus we see that support for bit gather and bit scatter operations can speed up bioinfor-

matics processing.

Figure 2.17: Compression of 12 of 19 bases (24 of 38 bits) in BLASTZ.

Figure 2.18: Translation of sets of 3 nucleotides into protein codons.

2.4.2 Linear Cryptanalysis

Cryptanalysis is the discipline of deciphering a ciphertext without having access to the

secret key. Th goal can be to recover the plaintext message or even the secret key itself. The

complexity of a cryptanalysis depends on the amount of time, memory and data required

to recover the key. The different methods of cryptanalysis can be distinguished depending

on the types of data they assumed: ciphertexts only attacks, in which the attacker has

28

access to only the ciphertext; known plaintext attacks, in which the attacker knows both the

ciphertext and plaintext; and chosen plaintext attacks, in which the attacker can choose the

plaintexts and obtain the corresponding ciphertexts.

Linear cryptanalysis is a known plaintext attack against block ciphers. This attack was

first introduced against DES [64]. During a linear cryptanalysis, the attacker studies prob-

abilistic linear relations (called linear approximations) between parity bits of the plaintext

P , the ciphertext C, and the secret key K. Given an approximation with high probability,

the attacker obtains an estimate for the parity bit of the secret key by analyzing the parity

bits of the known plaintexts and ciphertexts. The ultimate goal of the attacker is to obtained

an approximation for the whole cipher of the type:

ΓPT • PT ⊕ ΓCT • CT = ΓK •K (2.1)

with probability p and bias b = |p − 1/2|, where Γ is a bit mask that specifies bits of

interest and • indicate bitwise dot product. Such an approximation is significant only if

its probability p 6= 1/2, or its bias is non-trivial. The key recovery process is based on

a maximum likelihood method in which a pool of plaintexts is tested using (2.1). The

implementation of this step requires computing many parity operations on subsets of the

bits of the plaintext and ciphertext (the key is fixed for any particular attack and thus the

key bits fall out of the equation).

The key-recovery attack is generally impossible in practice because it is difficult to

obtain the requisite number of plaintext-ciphertext pairs. Furthermore, once a cipher is

found to be vulnerable to cryptanalysis, it falls out of usage. However, linear cryptanalysis,

along with differential cryptanalysis, is considered one of the main methods to test the

security of block ciphers during the design phase. It has been utilized, for example, in the

design of the AES [65].

While the evaluation of cipher candidates for susceptibility to linear cryptanalysis is

29

generally done by analyzing the structure of the cipher, specifically, the substitution boxes

(S-boxes), finding linear approximations across multiple rounds can be difficult. Addi-

tionally, there can be multiple approximations that use the same bits of the plaintext and

ciphertext but use different paths through the rounds, thus making them difficult to find.

The existence of these “linear hulls” makes attacks easier to perform as the bias of the

approximation is greater than would be assumed by following only one path through the

cipher.

Due to the difficulty of obtaining good approximations, we can define a search algo-

rithm (Figure 2.19) that attempts to find good approximations that holds with non-trivial

bias [66]. For a given random key, we generate a large number of plaintext-ciphertext pairs.

For each pair, we test a large number of potential relationships. We then observe whether

the probability of the relationship being true is significantly different than one half. If so, a

linear approximation has been found.

k = rand();For i = 1 to num trialsPT = rand();CT = Ek(PT );For j = 1 to num relationshipsP = PT • ΓPT [j];C = CT • ΓCT [j];R = P ⊕ C;count[j]+ = R;

For j = 1 to num relationshipsbias[j] = |count[j]/num trials −1/2|;exists rel[j] = (bias > ε); // ε ∼ 1/

√num trials

Figure 2.19: Algorithm for exhaustive search for linear relationships

The major operation performed in this search is parity computation. Thus, any way to

accelerate parity can speed up linear cryptanalysis.

30

2.4.3 Software-Defined Radio

In communication systems, messages are translated into physical signals for transmission

over a medium. The processing performed at this “physical layer” can be summarized by

the classical diagram of communication (Figure 2.20). The steps include:

• Channel coding – the message is encoded to reduce message size;

• Encryption – the message is optionally encrypted;

• Forward Error Correction (FEC) – redundancy is added to the message so that sym-

bols can be recovered in the presence of channel noise;

• Scrambling – the message is randomized, facilitating better signal lock-on (by re-

moving long strings of ones and zeros) or spreading the energy of the signal to match

its designed spectrum;

• Channel Coding (Modulation) – the message is mixed with a carrier signal for trans-

mission over a channel.

Figure 2.20: Communications Process

To facilitate efficient communication, the physical layer is typically implemented in

hardware using an ASIC. However, with Software-Defined Radio (SDR) systems, the phys-

ical layer is instead implemented by combination of software, reconfigurable hardware and

31

hardware. The purpose of this modification is to increase the flexibility of radio com-

munications as the physical layer differs depending on the choice of the communication

protocols (802.11, WI-MAX, CDMA, Bluethooth. . . ). Using a flexible SDR system solves

the following issues:

• Multiplicity of the devices – An increasing number of radio applications and proto-

cols currently results in having numerous devices each only capable of communicat-

ing through a single, distinct protocol (or a set of closely related protocols).

• Inefficient use of the spectrum – Using a fixed hardware implementation of the phys-

ical layer prevents adapting communication to efficiently utilize the available spec-

trum (adaptive coding).

• Impossibility to upgrade or update the system – A fixed hardware implementation

cannot be updated if changes to existing protocols are made or new protocols are

released. This issue is particularly important considering the ongoing need to secure

radio communication. The update of the broken cryptosystems (E0, A5/1, . . . ) used

to provide physical layer security has resulted in the need for entirely new devices.

• Slow development – A long process is required to introduce and to deploy a new

standard into market products.

Thus, SDR systems have emerged to overcome these issues: they can eliminate the cost and

overhead of maintaining multiple devices; can adapt protocol usage to maximize spectrum

utilization; can update a protocol in existing devices; and can allow for rapid, low-cost

testing and deployment of new standards.

The simplest SDR system consists of a general purpose processor attached to an analog-

to-digital/digital-to-analog converter and an RF front end (antenna). In such a system,

all the signal processing related to the radio communication is performed in software.

However, a general purpose processor is currently too slow for needs of the aggressive,

32

high-bandwidth algorithms and a DSP [67] or FPGA [68] is required. Alternatively, spe-

cial purpose extensions can be added to the general purpose CPU to speed up process-

ing [69]. However, to make these architectural extensions attractive to designers of com-

modity CPUs, the new instructions should be general enough to be reused across a number

of applications. Thus, we need to identify what operations are to be supported and design

appropriate, general instructions to accelerate these operations.

Linear Feedback Shift Registers – One basic building block utilized across many of

the communications steps is the Linear Feedback Shift Register (LFSR). A LFSR is a shift

register that has a number of taps to bit positions in the register. The parity of these taps is

then fed back to the input (Figure 2.21). LFSRs are used for:

• Encryption – in some stream ciphers, the output is the combination of multiple LF-

SRs (see paragraph on stream ciphers);

• Error Correction – cyclic redundancy check codes (CRC) are based on LFSRs;

• Scrambling – long streams of ones and zeros, which can be difficult for receivers to

track, can be broken up by XORing with the pseudorandom pattern output from an

LFSR;

• Modulation – in direct-sequence spread spectrum modulation, used by CDMA, the

message is XORed by a high frequency pseudorandom noise signal (thus spreading

the spectrum of the message) such as the output of an LFSR.

Thus we see that LFSRs play an important role in communication and accelerating parity,

the key operation of an LFSR, can greatly speedup communication.

Stream Ciphers – Stream ciphers are the alternative to block ciphers for performing

symmetric key encryption. A stream cipher differs from a block cipher as it has internal

33

Figure 2.21: LFSR for CRC-8 (polynomial = x8 + x2 + x+ 1)

state and as it can work on data of any length without padding. It is also not necessary to

know the size of the plaintext before the encryption. Thus stream ciphers are well suited

to communication systems, where message size can vary greatly and are not known in

advance. A common way to transform a block cipher into a stream cipher is to use counter

mode (CTR).

The basis of modern stream ciphers is a weakened version of the one time pad [70].

The ciphertext is obtained by combining each plaintext bit with a bit of a pseudorandom

sequence (keystream) which depends on a secret key (and initialization vector IV). The

common operation for combining the plaintext and the keystream is addition modulo 2 or

XOR. Such a system is know as an additive synchronous stream cipher.

A stream cipher is based on four building block:

• an internal state si;

• an initialization function which produces s0 from the secret key K and the IV ;

• a filtering function f which extracts the keystream from the internal state;

• an update function Φ which modifies the internal state every cycle.

Many stream ciphers use LFSRs to hold and update the internal state. One possible

filtering function f is to obtain the keystream by removing in a pseudorandom manner

some of the bits output from the internal state. This is referred to as decimation. One stream

cipher which combines these two concepts is the shrinking generator [71]. In this cipher,

two LFSRs are utilized – one holds and updates the internal state and the second generates

the decimation pattern. A variation of this cipher is the self-shrinking generator [72]. One

34

LFSR is used – the odd bits output from the LFSR form the sequence to be decimated

and the even bits form the decimation pattern. Decimation is equivalent to a bit gather

operation. The shrinking generator thus requires one bit gather operation and the self-

shrinking generator requires three bit gather operations – two bit gathers to form the two

subsequences and a third to perform the actual decimation.

Error Correction – In error correction, redundancy is introduced into the message so

that the symbols can be recovered in the face of channel noise. One basic error correction

scheme is linear block codes, in which parity bits are computed for various subsets of the

bits of a message block. The generation of multiple parity patterns can be obtained by direct

multiplication mod 2 of the message symbols with a generator matrix (Figure 2.22 [73]).

In convolutional coding, the input is processed as a continuous stream. One or more shift

registers store some number of previous input bits and the outputs are the parity of subsets

of the stored bits and the next input bits (Figure 2.23 [74]). These are also used as building

blocks in Low-Density Parity-Check Codes (LDPC), which are linear block codes with

sparse parity check matrices, and Turbo Codes (TC), which utilize two convolutional coders

as shown in Figure 2.24. Currently, LDPC codes and Turbo Codes are the most efficient

error correction coding schemes.

1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 1 10 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1 0 10 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1 10 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 1 1 10 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 0 0 1 10 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 0 0 10 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 0 10 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 10 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 0 1 10 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 0 1 1 0 10 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 0 1 1 10 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0

Figure 2.22: Generator matrix for extended binary Golay code

35

Figure 2.23: Convolutional encoder for 802.11g

Figure 2.24: Turbo code encoder

Puncturing – In puncturing, message bits are removed after error correction to remove

some redundancy. Puncturing allows for flexibility as the same code can be used at differing

message rates, thus obviating the need to support multiple error correction schemes and

simplifying the system. Removing message bits is equivalent to a bit gather operation.

Depuncturing is performed at the decoder to put back in bits. This is equivalent to a bit

scatter operation.

Thus we see that digital communications makes heavy use of operations that have parity

computation as an important component. We also see that a number of steps in the commu-

nication process require bit gather and bit scatter operations. Thus, any way to accelerate

bit gather, bit scatter and parity can speed up software-defined radio applications.

36

2.5 Summary

Bit-oriented instructions exist in all instruction set architectures. These range from instruc-

tions that operate on single bit fields to multiple bit fields in subwords to arbitrary collec-

tions of individual bits. However, current ISAs generally lack instructions to perform bit

gather and bit scatter operations and parity operations. These operations are quite useful, as

we see from applications such as bioinformatics, linear cryptanalysis and software-defined

radio. Consequently, we propose new instructions to perform these operations – the paral-

lel extract and parallel deposit instructions perform bit gather and bit scatter operations and

the bit matrix multiply instruction multiplies two matrices mod 2 allowing for fast parallel

parity computation as well as for performing many of the other bit-oriented instructions

mentioned in this chapter.

In the rest of this thesis, Chapter 3 defines the specialized parallel extract and par-

allel deposit instructions, describes the datapaths needed to implement these instructions

and then presents a functional unit that supports general and specialized bit manipulation

instructions. Chapter 4 shows how to enhance this functional unit to subsume the func-

tionality of the basic shifter. Chapter 5 examines the bit matrix multiplication operation

which can be used to accelerate parity; to perform the various subword-parallel multimedia

instructions done by commercial ISAs; and to perform a superset of the bit manipulation

operations. Chapter 6 analyzes applications that benefit from the newly proposed instruc-

tions.

37

Chapter 3

Advanced Bit Manipulation Operations

In the previous chapter, we showed the limitations of existing instructions for applications

such as bioinformatics, linear cryptanalysis and software-defined radio. We highlighted

the need for specialized advanced bit manipulation instructions to perform bit gather and

scatter and parity operations. In this chapter, we introduce novel parallel extract and parallel

deposit instructions to perform bit gather and bit scatter operations, respectively. Each of

these operations can take tens to hundreds of cycles to emulate in current microprocessors.

However, by defining new instructions for them, each can be implemented in one, or a few,

cycles.

This chapter is organized as follows. Section 3.1 gives the definitions for parallel ex-

tract and parallel deposit instructions. Section 3.2 discusses the datapaths needed for imple-

menting these advanced bit manipulation operations. Section 3.3 presents an advanced bit

manipulation functional unit which performs both general and specialized bit manipulation

instructions. Section 3.4 summarizes this chapter.

38

3.1 Instruction Definitions

Table 3.1 lists all the advanced bit manipulation instructions we consider (for a 64-bit pro-

cessor) and gives their cycle counts. The bfly, ibfly and grp permutation instructions

were previsouly defined in Section 2.3.1. The parallel extract (pex) and parallel deposit

(pdep) instructions, which perform bit gather and bit scatter operations, respectively will

be defined in Sections 3.1.1 and 3.1.2. These instructions have three variants – static, vari-

able and loop-invariant. The need for these three variants is elucidated in Sections 3.2

and 3.3. In particular, the hardware decoder used by variable pex.v and pdep.v will

be explained in Section 3.2.5 and the loop-invariant versions of pex and pdep, i.e., the

setib and setb instructions, will be explained in Section 3.3.

3.1.1 Bit Gather or Parallel Extract (pex)

A bit gather instruction collects bits dispersed throughout a word (or register), and places

them contiguously in the result. Such selection of non-contiguous bits from data is often

necessary. For example, in pattern matching, many pairs of features may be compared.

Then, a subset of these comparison result bits are selected, compressed and used as an

index to look up a table. This selection and compression of bits is a bit gather instruction.

A bit gather instruction can also be thought of as a parallel extract, or pex, instruction.

This is so named because it is like a parallel version of the single field extract (extr)

instruction described in Section 2.1. Figure 3.1 compares extr and pex. Recall, the

extr instruction extracts a single field of bits from any position in the source register and

right justifies it in the destination register. The pex instruction extracts multiple bit fields

from the source register, compresses and right justifies them in the destination register.

The selected bits (in r2) are specified by a bit mask (in r3) and placed contiguously and

right-aligned in the result register (r1).

The pex instruction can also be considered the grp right half of the grp operation [28,

39

Tabl

e3.

1:A

dvan

ced

bitm

anip

ulat

ion

inst

ruct

ions

Inst

ruct

ion

Des

crip

tion

Cyc

les

bfly

r 1=

r 2,a

r.b1,a

r.b2,a

r.b3

Perf

orm

But

terfl

ype

rmut

atio

nof

data

bits

inr 2

.1

ibfly

r 1=

r 2,a

r.ib 1

,ar.i

b 2,a

r.ib 3

Perf

orm

Inve

rse

But

terfl

ype

rmut

atio

nof

data

bits

inr 2

.1

pex

r 1=

r 2,r

3,a

r.ib 1

,ar.i

b 2,a

r.ib 3

Para

llel

extr

act,

stat

ic:

Dat

abi

tsin

r 2se

lect

edby

apr

e-de

code

dm

ask

r 3ar

eex

-tr

acte

d,co

mpr

esse

dan

dri

ght-

alig

ned

inth

ere

sult

r 1.

1

pdep

r 1=

r 2,r

3,a

r.b1,a

r.b2,a

r.b3

Para

lleld

epos

it,st

atic

:Rig

ht-a

ligne

dda

tabi

tsin

r 2ar

ede

posi

ted,

inor

der,

inre

sult

r 1at

bitp

ositi

ons

mar

ked

with

a“1

”in

the

stat

ical

ly-d

ecod

edm

ask

r 3.

1

mov

ar.x

=r 2

,r3

Mov

eva

lues

from

GPR

sto

AR

sto

setc

ontr

ols

(cal

cula

ted

byso

ftw

are)

for

pex,

pdep

,bfly

orib

fly.

1

pex.

vr 1

=r 2

,r3

Para

llele

xtra

ct,v

aria

ble:

Dat

abi

tsin

r 2se

lect

edby

ady

nam

ical

ly-d

ecod

edm

ask

r 3ar

eex

trac

ted,

com

pres

sed

and

righ

t-al

igne

din

the

resu

ltr 1

.3

pdep

.vr 1

=r 2

,r3

Para

lleld

epos

it,va

riab

le:

Rig

ht-a

ligne

dda

tabi

tsin

r 2ar

ede

posi

ted,

inor

der,

inre

sult

r 1in

bitp

ositi

ons

mar

ked

with

a“1

”in

the

dyna

mic

ally

-dec

oded

mas

kr 3

.3

setib

ar.ib

1,a

r.ib 2

,ar.i

b 3=

r 3Se

tin

vers

ebu

tterfl

yda

tapa

thco

ntro

lsin

the

asso

ciat

edA

Rs,

usin

gha

rdw

are

de-

code

rto

tran

slat

eth

em

ask

r 3to

inve

rse

butte

rfly

cont

rols

(for

loop

-inv

aria

ntpe

x).

2

setb

ar.b

1,a

r.b2,a

r.b3

=r 3

Set

butte

rfly

data

path

cont

rols

inth

eas

soci

ated

AR

s,us

ing

hard

war

ede

code

rto

tran

slat

eth

em

ask

r 3to

butte

rfly

cont

rols

(for

loop

-inv

aria

ntpd

ep).

2

grp

r 1=

r 2,r

3Pe

rfor

mG

roup

perm

utat

ion

(var

iabl

e):

Dat

abi

tsin

r 2co

rres

pond

ing

to“1

”sin

r 3ar

egr

oupe

dto

the

righ

t,w

hile

thos

eco

rres

pond

ing

to“0

”sar

egr

oupe

dto

the

left

.3

40

(a) (b)

Figure 3.1: (a) extr r1 = r2, pos, len (b) pex r1 = r2, r3

56]. It conserves some of the most useful properties of grp, while being easier to imple-

ment: the bits that would have been gathered to the left are instead zeroed out.

3.1.2 Bit Scatter or Parallel Deposit (pdep)

Bit scatter takes the right-aligned, contiguous bits in a register and scatters them in the

result register according to a mask in a second input register. This is the reverse operation

to bit gather. We also call bit scatter a parallel deposit instruction, or pdep, because it

is like a parallel version of the deposit (dep) instruction. Figure 3.2 compares dep and

pdep. The deposit (dep) instruction takes a right justified field of bits from the source

register and deposits it at any single position in the destination register. The parallel deposit

(pdep)instruction takes a right justified field of bits from the source register and deposits

the bits in different non-contiguous positions indicated by a bit mask.

(a) (b)

Figure 3.2: (a) dep r1 = r2, pos, len (b) pdep r1 = r2, r3

41

Figure 3.3: Labeling of butterfly network.

3.2 Datapaths for Parallel Extract and Parallel Deposit

It is not intuitively clear that pex and pdep are easy to implement, especially in a single

cycle. In this section, we show that parallel deposit can be mapped to the butterfly network

and that parallel extract can be mapped to the inverse butterfly network. This provides the

basis for a single functional unit that performs bit gather, bit scatter and bit permutation

operations using both butterfly and inverse butterfly datapaths.

3.2.1 Parallel Deposit on the Butterfly Network

We first show an example parallel deposit operation on the butterfly network. Then, we

show that any parallel deposit operation can be performed using a butterfly network.

Figure 3.3 shows our labeling of the left (L) and right (R) halves of successive stages

of a butterfly network. Since we often appeal to induction for successive stages, we usually

omit the subscripts for these left and right halves.

Figure 3.4(a) shows an example pdep operation with mask = 10101101. Figure 3.4(b)

42

(a)

(b) (c)

Figure 3.4: (a) 8-bit pdep operation (b) mapped onto butterfly network with explicit right rotationsof data bits between stages and (c) without explicit rotations of data bits by modifying the controlbits.

43

shows this pdep operation broken down into steps on the butterfly network. In the first

stage, we transfer from the right (R) to the left half (L) the bits whose destination is in L,

namely bits d and e. Prior to stage 2, we right rotate e00d by 3, the number of bits that

stayed in R, to right justify the bits in their original order, 00de, in the L half. Note that

bits that stay in the right half, R, are already right-justified.

In each half of stage 2 we transfer from the local R to the local L the bits whose final

destination is in the local L. So in R, we transfer g to RL and in L we transfer d to LL. Prior

to stage 3, we right rotate the bits to right justify them in their original order in their new

subnetworks. So d0 is right rotated by 1, the number of bits that stayed in LR, to yield 0d

and gf is right rotated by 1, the number of bits that stayed in RR, to yield fg.

In each subnetwork of stage 3 we again transfer from the local R to the local L the bits

whose final destination is in the local L. So in LL we transfer d and in LR we transfer e.

After stage 3 we have transferred each bit to its correct final destination: d0e0fg0h. Note

that we use a control bit of “0” to indicate a pass through operation and a control bit of “1”

to indicate a swap.

Rather than explicitly right rotating the data bits in the L half after each stage, we can

compensate by modifying the control bits. This is shown in Figure 3.4(c). How the control

bits are derived will be explained later in Section 3.2.4.

We now show that any pdep operation can be mapped to the butterfly network.

Fact 3.1. Any single data bit can be moved to any result position by just moving it

to the correct half of the intermediate result at every stage of the butterfly network.

This can be proved by induction on the number of stages. At stage 1, the data bit

is moved within n/2 positions of its final position. At stage 2, it is moved within n/4

positions of its final result, and so on. At stage lg(n), it is moved within n/2lg(n) = 1

position of its final result, which is its final result position. Referring back to Figure 3.4(b),

we utilized Fact 3.1 to decide which bits to keep in R and which to transfer from R to L at

44

each stage.

Fact 3.2. If the mask has k “1”s in it, the k rightmost data bits are selected and

moved, i.e., the selected data bits are contiguous. They never cross each other in

the final result.

This fact is by definition of the pdep instruction. See the example of Figure 3.4(a)

where there are 5 “1”s in the mask and the selected data bits are the 5 rightmost bits,

defgh; these bits are spread out to the left maintaining their original order, and thus never

crossing each other in the final result.

Fact 3.3. If a data bit in the right half (R) is swapped with its paired bit in the left

half (L), then all selected data bits to the left of it will also be swapped to L (if they

are in R) or stay in L (if they are in L).

Since the selected data bits never cross each other in the final result (Fact 3.2), once a

bit swaps to L, the selected bits to the left of it must also go to L. Hence, if there is one

“1” in the mask, the one selected data bit, d0, can go to R or L. If there are two “1”s in the

mask, the two selected data bits, d1d0, can go to RR or LR or LL. That is, if the data bit on

the right stays in R, then the next data bit can go to R or L, but if the data bit on the right

goes to L, the next data bit must also go to L. If there are three “1”s, the three selected data

bits, d2d1d0, can go to RRR, LRR, LLR or LLL. For example, in Figure 3.4(b) stage 1, the

five bits have the pattern LLRRR as e is transferred to L and d must then stay in L.

Fact 3.4. The selected data bits that have been swapped from R to L, or stayed in

L, are all contiguous mod n/2 in L.

From Fact 3.3, the destinations of the k selected data bits dk−1 . . .d0 must be of the form

L. . .LR. . .R, a string of zero or more L’s followed by zero or more R’s (see Figure 3.5).

Define X as the bits staying in R, Y as the bits going to L that start in R and Z as the bits

going to L that start in L. It is possible that:

45

i. X alone exists – when there are no selected data bits that go to L,

ii. Y alone exists – when there are no selected data bits that start in L and all bits that start

in R go to L,

iii. X and Y exist – when there are no selected data bits that start in L and some bits that

start in R stay in R and some go to L,

iv. X and Z exist – when all the bits in R are going to R, and all bits going to L start in L,

or

v. X , Y and Z exist.

When X alone exists (i), there are no bits that go to L, so Fact 3.4 is irrelevant.

The structure of the butterfly network requires that when bits are moved in a stage, they

all move by the same amount. Fact 3.2 states that the selected data bits are contiguous.

Together these imply that when Y alone exists or X and Y exist (ii and iii), Y is moved as

a contiguous block from R to L and Fact 3.4 is trivially true.

When X and Z exist (iv), Z is a contiguous block of bits that does not move so again

Fact 3.4 is trivially true.

When X , Y and Z exist (v), Y comprises the leftmost bits of R, and Z the rightmost

bits in L since they are contiguous across the midpoint of the stage (Fact 3.2). When Y is

swapped to L, since the butterfly network moves the bits by an amount equal to the size of L

or R in a given stage, Y becomes the leftmost bits of L. Thus Y and Z are now contiguous

mod n/2, i.e., wrapped around, in L (Figure 3.5).

Thus Fact 3.4 is true in all cases.

For example, in Figure 3.4(b) at the input to stage 1, X is bits fgh, Y is bit e and Z

is bit d. Y is the leftmost bit in R and Z is the rightmost bit in L. After stage 1, Y is the

leftmost bit in L and is contiguous with Z mod 4, within L, i.e., de is contiguous mod 4 in

e00d.

46

Figure 3.5: At the output of a butterfly stage, Y and Z are contiguous mod n/2 in L and can berotated to be the rightmost bits of L.

Fact 3.5. The selected data bits in L can be rotated so that they are the rightmost

bits of L, and in their original order.

From Fact 3.4, the selected data bits are contiguous mod n/2 in L. At the output of

stage 1 in Figure 3.5, these bits are offset to the left by the size of X (the number of bits

that stayed in R), denoted by |X|. Thus if we explicitly rotate right the bits by |X|, the

selected data bits in L are now the rightmost bits of L in their original order (Figure 3.5).

In Figure 3.4(b), Fact 3.5 was utilized prior to stages 2 and 3.

At the end of this step, we have two half-sized butterfly networks, L and R, with the

selected data bits right-aligned and in order in each of L and R (last row of Figure 3.5). The

above can now be repeated recursively for the half-sized butterfly networks, L and R, until

each L and R is a single bit. This is achieved after lg(n) stages of the butterfly network (see

Figure 3.4(b)).

The selected data bits emerge from stage 1 in Figure 3.5 rotated to the left by |X|. In

Fact 3.5, the selected data bits are explicitly rotated back to the right by |X|. Instead, we

can compensate for the rotation by modifying the control bits of the subsequent stages to

limit the rotation to be within each subnetwork. For example, if the n-bit input to stage 1 is

rotated by k positions, the two n/2-bit inputs to the L and R subnetworks are rotated by k

47

(mod n/2) within each subnetwork. At the output of stage lg(n), the subnetworks are 1-bit

wide so the rotations are absorbed.

Fact 3.6. If the data bits are rotated by k positions left (or right) at the input to a

stage of a butterfly network, then at the output of that stage we can obtain a rotation

left (or right) by k positions of each half of the output bits by rotating left (or right)

the control bits by k positions and complementing them upon wrap around.

Consider again the example of Figure 3.4(b). The selected data bits emerge from stage

1 left rotated by 3 bits, i.e., L is e00d, left rotated by 3 positions from 00de. In Fig-

ure 3.4(b), we explicitly rotated the data bits back to the right by 3. Instead, we can

compensate for this left rotation by left rotating and complementing upon wrap around

by 3 positions the control bits of the subsequent stages. This is shown in Figure 3.6.

For stage 2, the control bit pattern of L, after left rotate and complement by 3, becomes

10 → 00 → 01 → 11. The rotation by 3 is limited to a rotation by 3 mod 2 = 1 within

each half of the output of L of stage 2 as the output is transformed from d0,0e in Fig-

ure 3.4(b) to 0d,e0 in Figure 3.6. For stage 3, the rotation and complement by 3 of the two

single control bits in L become three successive complements: 1 → 0 → 1 → 0, and the

left rotation of L is absorbed as the overall output is still d0e0. Hence, Figure 3.6 shows

how the control bits in stages 2 and 3 compensate for the left rotate by 3 bits at the output

of stage 1 (cf. Figure 3.4(b).)

Figure 3.4(c) shows the control bits after also compensating for the left rotate by 1 bit

of RL and LL, prior to stage 3 in Figure 3.6. The explicit right rotation prior to stage 3 is

eliminated. Instead, the two control bits in RL and LL transform from 0 → 1 to absorb the

rotation. Hence, the overall result in Figure 3.4(c) remains the same as in Figure 3.4(b).

The explicit data rotations in Figure 3.4(b) are replaced with rotations of the control bits

instead, complementing them on wraparound.

We now explain why the control bits are complemented when they wrap around. The

48

Figure 3.6: The elimination of explicit right rotations after stage 1 in Figure 3.4(b).

goal is to keep the data bits in the half they were originally routed to at each stage of the

butterfly network, in spite of the rotation of the input.

Figure 3.7(a) shows a pair of bits, a and b, that were originally passed through. So

we wish to route a to L and b to R in spite of any rotation. As the bits are rotated (Fig-

ure 3.7(b)), the control bit is rotated with them, keeping a in L and b in R, as desired.

When the bits wrap around, (Figure 3.7(c)), a wraps to R and b crosses the midpoint to

L. If the control bit is simply rotated and wrapped around with the paired bits, then a is

passed through to R and b is passed through to L, which is contrary to the originally desired

behavior. If instead the control bit is complemented when it wraps around (Figure 3.7(d)),

then a is swapped back to L and b is swapped back to R, as is desired.

Similarly, if a and b were originally swapped (Figure 3.8a)), a should be routed to R

and b to L. As the bits rotate (Figure 3.8(b)), we simply rotate the control bit with them.

When the bits wrap around (Figure 3.8(c)), input a wraps to R and b crosses to L. When

they are swapped, a is routed to L and b to R, contrary to their original destinations. If

49

(a) (b)

(c) (d)

Figure 3.7: (a) a pair of data bits initially passed through; (b) rotation of the paired data bits andcontrol bit; (c) wrapping of the data bits and control bit; (d) complementation of the control bit.

the control bit is complemented on wraparound, a is passed through to R and b is passed

through to L, conforming to the originally desired behavior.

Thus complementing the control bit when it wraps and changing the behavior of the

paired bits from pass through to swap or vice versa causes each of the pair of bits to stay in

the half to which it was originally routed despite the rotation of the input pushing each bit

to the other half. This limits the rotation of the input to be within each half of the output

and not across the entire output.

We now give a theorem to formalize the overall result:

Theorem 3.7. Any parallel deposit instruction on n bits can be implemented with

one pass through a butterfly network of lg(n) stages without path conflicts (with the

bits that are not selected zeroed out externally).

Proof. We give a proof by construction. Assume there are k “1”s in the right half of the bit

mask. Then, based on Fact 1, the k rightmost data bits (blockX in Figure 3.5) must be kept

in the right half (R) of the butterfly network and the remaining contiguous selected data bits

must be swapped (block Y ) or passed through (block Z) to the left half (L). This can be

50

(a) (b)

(c) (d)

Figure 3.8: (a) A pair of data bits initially swapped; (b) rotation of the paired data bits and controlbit; (c) wrapping of the data bits and control bit; (d) complementation of the control bit.

accomplished in stage 1 of the butterfly network by setting the k rightmost configuration

bits to “0” (to pass through X) and the remaining configuration bits to “1” (to swap Y ).

At this point, the selected data bits in the right subnetwork (R) are right-aligned but

those in the left subnetwork (L) are contiguous mod n/2, but not right aligned (see Fact 3.4

and Figure 3.5) – they are rotated left by the size of block X or the number of bits kept in

R. We can compensate for the left rotation of the bits in L and determine the control bits

for subsequent stages as if the bits in L were right aligned. This is accomplished by left

rotating and complementing upon wraparound the control bits in the subsequent stages of

L by the number of bits kept in R (once these control bits are determined pretending that

the data bits in L are right aligned). Modifying the control bits in this manner will limit

the rotation to be within each half of the output until the rotation is absorbed after the final

stage (Fact 3.6).

Now the process above can be repeated on the left and right subnets, which are them-

selves butterfly networks: count the number of “1”s in the local right half of the mask and

then keep that many bits in the right half of the subnetwork and swap the remaining selected

data bits to the left half. Account for the rotation of the left half by modifying subsequent

51

Figure 3.9: Even (or R, dotted) and odd (or L, solid) subnetworks of the inverse butterfly network.

control bits.

This can be repeated for each subnetwork in each subsequent stage until the final stage is

reached, where the final parallel deposit result will have been achieved (e.g., Figure 3.4(c)).

3.2.2 Parallel Extract on the Inverse Butterfly Network

We will now show that pex can be mapped onto the inverse butterfly network. The inverse

butterfly network is decomposed into even and odd subnetworks, in contrast to the butterfly

network which is decomposed into right and left subnetworks. See Figure 3.9, where the

even subnetworks are shown with dotted lines and the odd with solid lines. However, for

simplicity of notation we refer to even as R and odd as L.

Fact 3.8. Any single data bit can be moved to any result position by just moving

it to the correct R or L subnetwork of the intermediate result at every stage of the

inverse butterfly network.

This can be proved by induction on the number of stages. At stage 1, the data bit is

52

moved to its final position mod 2 (i.e., to R or L). At stage 2, it is moved to its final position

mod 4, and so on. At stage lg(n), it is moved to its final position mod 2lg(n) = n, which is

its final result position.

Fact 3.9. A permutation is routable on an inverse butterfly network if the destina-

tions of the bits constitute a complete set of residues mod m (i.e., the destinations

equal 0, 1, . . . ,m− 1 mod m) for each subnetwork of width m.

Based on Fact 3.8, bits are routed on the inverse butterfly network by moving them to the

correct position mod 2 after the first stage, mod 4 after the second stage, etc. Consequently,

if the two bits entering stage 1, (with 2-bit wide inverse butterfly networks as shown in

Figure 3.9), have destinations equal to 0 and 1 mod 2 (i.e., one is going to R and one to L),

Fact 3.8 can be satisfied for each bit and they are routable through stage 1 without conflict.

Subsequently, the four bits entering stage 2 (with the 4-bit wide butterfly networks) must

have destinations equal to 0, 1, 2 and 3 mod 4 to satisfy Fact 3.8 for each bit and be routable

through stage 2 without conflict. A similar constraint exists for each stage.

Theorem 3.10. Any Parallel Extract (pex) instruction on n bits can be imple-

mented with one pass through an inverse butterfly network of lg(n) stages without

path conflicts (with the un-selected bits on the left zeroed out).

Proof. The pex operation compresses bits in their original order into adjacent bits in the

result. Consequently, two adjacent selected data bits that enter the same stage 1 subnetwork

must be adjacent in the output – one bit has a destination equal to 0 mod 2 and the other has

a destination equal to 1 mod 2. Thus the destinations constitute a complete set of residues

mod 2 and thus are routable through stage 1. The selected data bits that enter the same

stage 2 subnetwork must be adjacent in the output and thus form a set of residues mod 4

and are routable through stage 2. A similar situation exists for the subsequent stages, up to

the final n-bit wide stage. No matter what the bit mask of the overall pex operation is, the

selected data bits will be adjacent in the final result. Thus the destination of the selected

53

data bits will form a set of residues mod n and the bits will be routable through all lg(n)

stages of the inverse butterfly network.

3.2.3 Need for Two Datapaths

It would be convenient if both pex and pdep can be implemented using the same datapath

circuit. Unfortunately, this is not possible. There is one unique path between any input

position and output position in the butterfly or inverse butterfly network (see Fact 3.1 and

Fact 3.8 – once a bit has been moved to the correct right or left subnetwork, subseqeunt

stages cannot route bits back to the other subnetwork). Consequently, we need only show

one counter example where paths conflict to prove that pex on butterfly and pdep on

inverse butterfly are not possible in the general case.

We first consider trying to implement pex using a butterfly circuit. From Fact 3.1

we see that parallel extract cannot be mapped to the butterfly network. Parallel extract

compresses bits and thus it is easy to encounter scenarios where two bits entering the same

switch would both require the same output in order to be moved to the correct half (L or R)

corresponding to their final destinations. Consider the parallel extract operation shown in

Figure 3.10. In order to move both bits d and h to the correct half of their final positions,

both must be output in the right half after stage 1. This clearly is a conflict and thus parallel

extract cannot be mapped to butterfly.

We now consider implementing pdep using an inverse butterfly circuit. From Fact 3.8

we see that parallel deposit cannot be mapped to the inverse butterfly network. Parallel

deposit scatters right-aligned bits to the left, and thus it is easy to encounter scenarios

where two bits entering the same switch would both require moving to the same left, or

right, subnetwork. Consider the parallel deposit operation shown in Figure 3.11. In order

to move both bits g and h to their final positions mod 2, both must be output in the right

subnet (i.e., even network: 0 mod 2) after stage 1. This clearly is a conflict and thus parallel

54

Figure 3.10: Conflicting paths when trying to map pex to butterfly.

deposit cannot be mapped to the inverse butterfly datapath.

3.2.4 Decoding the Bitmask into Butterfly Control Bits

The steps in the proof to Theorem 3.7 give an outline for how to decode the n-bit bitmask

into controls for each stage of a butterfly datapath. For each right half of a stage of a

subnetwork we count the number of “1”s in that local right half of the mask, say k “1”s,

and then set the k rightmost control bits to “0”s and the remaining bits to “1”s. This serves

to keep block X in the local R half and export Y to the local L half (refer to Figures 3.3

and 3.5 for nomenclature). We then assume that we explicitly rotate Y and Z to be the

rightmost bits in order in the local L half and iterate through the stages to come up with an

initial set of control bits. After this, we eliminate the need for explicit rotations of Y and Z

by modifying the control bits. This is accomplished by a left rotate and complement upon

wrap around (LROTC) operation, rotating the control bits by the same amount obtained

when assuming explicit rotations.

We will now simplify this process considerably. First, note that when we modify control

55

Figure 3.11: Conflicting paths when trying to map pdep to inverse butterfly.

bits to compensate for a rotation in a given stage, we do so by propagating the rotation

through all the subsequent stages. This means that when the control bits of a local L are

modified, they are rotated and complemented upon wrap around by the number of “1”s in

the local R, and by the number of “1”s in the local R of the preceding stage, and by the

number of “1”s in all the local R’s of all preceding stages up to the R in the first stage. In

other words, the control bits of the local L are rotated by the total number of “1”s to its

right in the bitmask.

Consider the example of Figure 3.4(b). The control bit in stage 3 in the LL subnetwork

is initially a “1” when we assumed explicit rotations. We first rotated and complemented

this bit by 3, the number of “1”s in R of the bitmask: 1 → 0 → 1 → 0 (Figure 3.6). We

then rotated and complemented this bit by another 1 position, the number of “1”s in LR of

the bitmask: 0 → 1. This yielded the final control bit in Figure 3.4(c). Overall we rotated

this bit by 4, the total number of “1”s to the right of LL or to the right of position 6. This is

a population count (POPCNT) of the bitstring from the rightmost bit to bit position 6.

Second, we need to produce a string of k “0”s from a count (in binary) of k, to derive

56

Figure 3.12: LROTC(“1111”, 3) = “1000”.

the initial control bits assuming explicit rotations. This can also be done with a LROTC

operation. We start with a string of all “1”s of the correct length and then, for every position

in the rotation, we wrap around a “1” from the left and complement it to get a “0” on the

right. The end result, after a LROTC by k, is a string of the correct length with k rightmost

bits set to “0” and the rest set to “1” (Figure 3.12, where k = 3).

We can now combine these two facts: the initial control bits are obtained by a LROTC

of a zero string the length of the local R by the POPCNT of the bits in the mask in the local

R and all bits to the right of it. We denote a string of k “1”s as 1k. We specify a bitfield

from bit h to bit v as {h:v}, where v is to the right of h. So,

• for stage 1, we calculate the control bits as

LROTC(1n/2, POPCNT(mask{n/2− 1:0})),


LROTC(1n/4, POPCNT(mask{3n/4− 1:0})) for L and

LROTC(1n/4, POPCNT(mask{n/4− 1:0})) for R,


LROTC(1n/8, POPCNT(mask{7n/8− 1:0})) for LL,

LROTC(1n/8, POPCNT(mask{5n/8− 1:0})) for LR,

LROTC(1n/8, POPCNT(mask{3n/8− 1:0})) for RL, and

LROTC(1n/8, POPCNT(mask{n/8− 1:0})) for RR,

and so on for the later stages.

57

Let us verify that this is correct using the example of Figure 3.4(c):

• for stage 1, the control bits are

LROTC(14, POPCNT(“1101”)) = LROTC(1111, 3) = 1000,

• for stage 2, the control bits are

LROTC(12, POPCNT(“101101”)) = LROTC(11, 4) = 11 for L and

LROTC(11, POPCNT(“01”)) = LROTC(12, 1) = 10 for R,

• for stage 3, the control bit is

LROTC(11, POPCNT(“0101101”)) = LROTC(1, 4) = 1 for LL,

LROTC(11, POPCNT(“01101”)) = LROTC(1, 3) = 0 for LR,

LROTC(11, POPCNT(“101”)) = LROTC(1, 2) = 1 for RL, and

LROTC(11, POPCNT(“1”)) = LROTC(1, 1) = 0 for RR.

This agrees with the result shown in Figure 3.4(c).

Overall, the population counts of the bits from position to 0 to position k, for k = 0

to n − 2, are required. We call this the set of prefix population counts. One interesting

observation is that for stage 1 we need one count of n/21 bits, for stage 2 we need the two

counts of the odd multiples of n/22 bits, for stage 3 we need the four counts of the odd

multiples of n/23 bits and so on.

Using the two new functions we have defined, LROTC and POPCNT, we present an

algorithm (Algorithm 3.11, Figure 3.13) to decode the n mask bits into the n × lg(n)/2

control bits for pdep (and for pex). Algorithm 3.11 summarizes the discussion above for

obtaining the control bits for achieveing pdep on a butterfly circuit.

We note that Algorithm 3.11 can also be used to obtain the control bits for the inverse

butterfly for a pex operation, with the one caveat that the controls for stage i of the butterfly

datapath are routed to stage lg(n)−i+1 in the inverse butterfly datapath. This can be shown

using an approach similar to that for pdep, except for working backwards from the final

58

Algorithm 3.11. To generate the n× lg(n)/2 butterfly or inverse butterfly control bits fromthe n-bit mask.

Input: mask; the bitmaskOutput: bcb; the lg(n)× n/2 matrix containing the butterfly control bits

ibcb; the lg(n)× n/2 matrix containing the inverse butterfly control bits

Let x||y indicate the concatenation of bit patterns x and y. POPCNT(a) is the populationcount of “1”s in bitfield a. mask{h:v} is the bitfield from bit h to bit v of the mask. 1k

indicates a bit-string of k ones. LROTC(a, rot) is a “left rotate and complement on wrap”operation, where a is the input and rot is the rotation amount.

1. Calculate the prefix population counts:

For i = 1, 2, . . ., n− 2pc[i] = POPCNT(mask{i:0})

2. Calculate the butterfly (and inverse butterfly) control bits for each stage by perform-ing LROTC(1k, pc[m]), where k is the size of the local R and m in the set of theleftmost bit positions of the local R’s:

bcb = ibcb = {}For i = 1, . . ., lg(n) //for each stagek = n/2i //number of bits in local RFor j = 1, 3, 5, . . ., 2i − 1 //for each local Rm = j × k − 1 //the leftmost bit position of the local Rtemp = LROTC(1k, pc[m])bcb[i] = temp || bcb[i]ibcb[lg(n)− i+ 1] = temp || ibcb[lg(n)− i+ 1]

Figure 3.13: Algorithm for decoding a bitmask into controls for a butterfly (or inverse butterfly)datapath.

stage. The derivation is given in more detail in an appendix to this chapter (Section 3.5).

3.2.5 Hardware Decoder

The execution time of Algorithm 3.11 in software is approximately 675 cycles on an Intel

Pentium-D processor. This software routine is useful only for pex or pdep operations for

which the bit mask is known ahead of time. However, for dynamic masks, we require a

hardware decoder that implements Algorithm 3.11 in order to achieve a high performance.

59

Fortunately, Algorithm 3.11 just contains two basic operations, population count and left-

rotate-and-complement, both of which have straightforward hardware implementations. A

block diagram of the decoder is show in Figure 3.14.

Figure 3.14: Hardware decoder for dynamic pdep and pex operation (for pex, the order of theoutputs is reversed).

Before describing the decoder in detail, we make note of a simplification based on the

properties of rotations – that they are invariant when the rotation amount differs by the

period of rotation, which for an k-bit LROTC is 2k. Thus, for the ith stage of the butterfly

network, k = n/2i (see Algorithm 3.11) and the POPCNTs are only computed mod n/2i−1.

For example, for the 64-bit hardware decoder, for the 32 butterfly stage 6 POPCNTs corre-

sponding to the odd multiples of n/64, we need only compute the POPCNTs mod 2 – only

the least significant bit; for the 16 butterfly stage 5 POPCNTs, we need only compute the

POPCNTs mod 4 – the two least significant bits; and so on. Only the POPCNT of 32 bits

for stage 1 requires the full lg(n)-bit POPCNT. In total, 120 bits are computed instead of

the 378 bits that would have been computed had the full lg(n)-bit POPCNTs been required

for each position.

The first stage of the decoder is a parallel prefix population counter. This is a circuit

that computes in parallel all the population counts of step 1 of Algorithm 3.11. In designing

this circuit we considered population counter architectures and parallel prefix networks for

60

(a) (b)

(c) (d)

Figure 3.15: (a) Count of 8 bits; (b) Count of 24 bits; (c) Count of 32 bits; (d) Count of 56 bits

adders.

The carry shower counter [75] groups the input into sets of three lines which are input

into full adders. The sum and carry outputs of the full adders are each grouped into sets of

three lines which are input to another stage of full adders and so on.

In our circuit, in the first stage we group 8 bits into 3 sets due to the structure of the

butterfly and inverse butterfly networks, i.e., the subnetworks have power of 2 sizes. In

stage 2, we add three sums and three carries to produce the sum of the 8 bits in carry-save

form. Figure 3.15(a) shows the first two stages of the counter. The output of this stage is a

2-bit sum and 2-bit carry.

We then gather 3 sets of sums and carries in stage 3 and produce the sum of 24 bits

61

in carry-save form (this requires a second full adder delay within the stage as shown in

Figure 3.15(b)). In stage 4, we gather up to three intermediate results and reduce these

to carry-save form. Stage 4 for the 32-bit count, which requires the full lg(n) bits, is

shown in Figure 3.15(c) and for the 56-bit count, which is computed mod 16, is shown

in Figure 3.15(d). We then use a carry-propagate adder to complete the addition. The

complete network for the 32-bit, 56-bit and 63-bit count are shown in Figure 3.16.

The parallel prefix architecture resembles radix-3 Han-Carlson [76], which is a parallel

prefix look-ahead carry adder that has lg(n) + 1 stages with carries propagated to the odd

positions in the extra final stage to limit fanout at the earlier levels. In our circuit, we

defer the 1- and 2-bit counts to the end, similar to odd carries being deferred in the Han-

Carlson adder, as these are easy to compute from the other counts. This can be seen in

Figure 3.16(c). The radix-3 nature stems from the carry shower counter design, as we

group 3 lines to input into a full adder at each level. Thus, the counter has log3(64) + 2 = 6

stages (where each stage can have multiple full adder delays). We note that in order to

effeciently compute the 3-bit and greater counts, the first two stages are overlapped every 4

positions. This can be seen in Figure 3.17(a) which shows the diagram of the entire parallel

prefix network. The numbers at the top are bit positions. The numbers at the bottom

indicate the number of bits in the sum. The trapezoids indicate adder blocks such as those

in Figure 3.15. The squares indicate the final carry-propagate adder, which can be as small

as 1-bit.

The parallel prefix population count circuit computes the population count of the low 32

bits, the low 16 bits and the low 8 bits (and the low 4 bits and the low 2 bits). With a minor

extension, the population count of the full 64 bit word can be computed (Figure 3.17(b)).

Similarly, the population count of each byte is computed in carry save form and with min-

imal logic can be transformed into the full 4-bit count of each byte. By modifiying the

network, we can also compute the population count of each 16-bit or 32-bit subword. Thus

the parallel prefix population count circuit can be used as the basis of a population count

62

(a)

(b)

(c)

Figure 3.16: (a) 32-bit count network; (b) 56-bit count network; (a) 63-bit count network;

63

(a) (b)

Figure 3.17: (a) Parallel prefix population counter network; (b) 64-bit parallel prefix populationcounter supporting population count of 64, 32, 16 and 8 bits.

64

functional unit and to support useful variations of the population count instruction currently

unavailable in current ISAs.

The outputs from the population counter control the LROTC circuits, one for each local

R of each stage. Each LROTC circuit is realized as a barrel rotator modified to complement

the bits that wrap around (Figure 3.18(a)). However, while a standard 2m-bit rotator has

m stages and control bits, this rotator has m + 1 stages and control bits, as the period of

the 2m-bit LROTC operation is 2m+1. The final stage selects between its input and the

complement, as the bits wrap 2m positions.

Propagating the constants at the input greatly simplifies the LROTC circuit (see Algo-

rithm 3.11 – we always calculate LROTC(1k, count)). However, instead of inputting the

one string, we simplify the circuit by inputting the zero string to the rotator (as it is simpler

to propagate zeros) and then adding an extra layer of inverters at the output (Figure 3.18(b)).

When the two inputs to a multiplexer are both zero, the multiplexer is replaced by

a constant logic “0” (Figure 3.18(c)). When a “0” and “1” are fed to the “0” and “1”

inputs, respectively, the multiplexer is replaced by a pass through of the select control

(Figure 3.18(d)). When a logic signal is fed to “0” input and a “1” to the “1” input, the

multiplexer reduces to an OR of the logic signal and the select control (Figure 3.18(e)).

When a “0” is fed to the “0” input and a logic signal is fed to the “1” input, the multiplexer

reduces to an AND of the logic signal and the select control (Figure 3.18(f)). When a

logic signal is fed to the ”0” input and the complement of the logic signal is fed to the

“1” input, the multiplexer reduces to an XOR of the logic signal and the select control

(Figure 3.18(g)). A pass through and inverter is just an inverter and an XOR and inverter is

just an XNOR (which has the same implementation cost of XOR).

Also note that for the counts of the last butterfly stage, the LROTC circuits are simply

inverters (as LROTC(“1”, b) = b for single bit b).

An overall diagram of the decoder is shown in Figure 3.14. The outputs from the de-

65

(a) (b)

(c) (d)

(e) (f)

(g)

Figure 3.18: (a) Barrel rotator implementation of 8-bit LROTC circuit; (b)-(f) Simplification ofLROTC circuit; (g) Final simplified LROTC circuit.

66

coder are routed to the butterfly network to be used for the pdep instruction. Additionally,

the outputs are routed to the inverse butterfly network, reversing the order of the stages, to

be used for the pex instruction. Unfortunately, there is no overlap between the decoder and

the routing of the data through the butterfly network for the pdep instruction since the con-

trol bits for the first stage of the butterfly network depend on the widest population count

(see Algorithm 3.11 and Figure 3.14), which takes the longest to generate. For the pex

instruction, on the other hand, the control bits for the first stages of the inverse butterfly

network are generated very quickly – hence the time taken for the decoder (for calculating

the control bits for the later stages) can be overlapped with the routing of the data.

3.3 Advanced Bit Manipulation Functional Unit

Given the set of advanced bit manipulation operations (Table 3.1), the knowledge of which

datapaths can be used to perform the operations (Section 3.2), the algorithm for configur-

ing those datapaths (Algorithm 3.11), and the hardware implementation of that algorithm

(Figure 3.14), we are now in position to construct an advanced bit manipulation functional

unit. We sketch the blocks of a functional unit that supports all four operations – parallel

extract, parallel deposit, butterfly permutations and inverse butterfly permutations.

The first building block is a butterfly network followed by a masking stage and extra

register storage to hold the configuration for the butterfly network (Figure 3.19(a), which

shows a 64-bit functional unit with 6 stages). This supports the pdep and bfly instruc-

tions. The masking stage zeroes the bits not selected in the pdep operation. For bfly, we

use a mask of all “1”s to select all the permuted bits. A set of application registers (ARs)

is used to hold the control bits for the butterfly network (as discussed in Section 2.3.1).

Figure 3.19(a) supports butterfly permutations as the application registers can be loaded

with the configuration for any given permutation. Parallel deposit is also supported as a

configuration for the butterfly network that performs the desired parallel deposit operation

67

(a) (b)

Figure 3.19: (a) Functional unit supporting butterfly and parallel deposit operations. (b) Functionalunit building block supporting inverse butterfly and parallel extract operations.

can be derived by a software routine using Algorithm 3.11. The ARs are loaded with this

configuration and the input is permuted through the butterfly network. The bits that are

not needed are then masked out using the bit mask that defines the desired parallel deposit

operation.

Figure 3.19(b) shows the analog of this building block for implementing inverse butter-

fly permutations and the parallel extract instruction. This contains a masking stage followed

by an inverse butterfly network and a set of application registers to hold the configuration.

The ARs are loaded with the configuration for some inverse butterfly permutation or a par-

allel extract operation. For parallel extract, we mask out the undesired bits with the bit

mask and then permute through the inverse butterfly network.

Figure 3.20 combines these two building blocks to form a single functional unit that

supports all four operations.

However, utilizing only a software implementation of Algorithm 3.11 is not always de-

sirable. The compiler (or programmer) can only derive the configuration when the pex or

pdep operation is static, i.e., if the bit mask is known when the program is being written or

compiled. However, the bit mask might be variable and only known at runtime. In this case,

the configuration for the butterfly or inverse butterfly datapath must be produced on the fly

68

Figure 3.20: Functional unit supporting butterfly, inverse butterfly, parallel extract and parallel de-posit.

and the performance cost due to running the software routine might be undesirable. For

the highest performance, we use the hardware decoder circuit of Figure 3.14. Figure 3.21

shows the functional unit enhanced with the decoder. (The circular arrow in Figure 3.21

indicates reversing the stages as the control bits for stage i of the butterfly network are

equivalent to those for stage lg(n) + 1− i of the inverse butterfly network.)

Figure 3.21: Functional unit supporting butterfly, inverse butterfly and static and variable parallelextract and parallel deposit instructions.

At this point, we can define two variations of the parallel extract and parallel deposit

69

instructions – static pex and pdep that use a pre-decoded set of configuration bits stored

in the ARs and dynamic or variable pex.v and pdep.v that use the hardware decoder to

decode the mask at runtime.

Another possible scenario is that the configuration is only known at runtime, but it is

unchanging across many iterations of a loop, i.e., it is loop invariant. In this case, we

use the hardware decoder circuit once to load a configuration for pex or pdep into the

application registers (see the new multiplexers in front of the ARs in Figure 3.22) and

then shut down the decoder (by adding latches to block the bitmask input) and use the

application registers directly. This has the advantage of removing the decoder from the

critical path thereby decreasing latency for the subsequent pex or pdep instructions, and

also possibly conserving power by shutting down the decoder circuitry. We define the

setib and setb instructions to decode a bit mask and store the result in the inverse

butterfly or butterfly ARs.

Figure 3.22: Functional unit supporting butterfly, inverse butterfly and static, variable and loopinvariant pex and pdep.

For completeness, we also consider a functional unit that supports the grp permutation

instruction. The grp permutation instruction is equivalent to a standard parallel extract

70

ORed with a “parallel extract to the left” of the bits selected with “0”s. Consequently, for

the functional unit to support grp, we need to add a second decoder and a second inverse

butterfly network to perform the “parallel extract to the left” (Figure 3.23).

Figure 3.23: Functional unit supporting butterfly, inverse butterfly and static, variable and loopinvariant pex and pdep and grp.

3.3.1 Implementation of the Standalone Advanced Bit Manipulation

Functional Unit

We evaluated the functional units of Figures 3.19(a), 3.19(b), 3.20, 3.22 and 3.23 for

timing and area. These support static pdep only (Figure 3.19(a)), static pex only (Fig-

ure 3.19(b)), both static pex and pdep (Figure 3.20), loop invariant and dynamic pex.v

and pdep.v as well (Figure 3.22), and grp as well (Figure 3.23). The static pdep only

unit also supports the bfly permutation instruction, the static pex only unit also supports

the ibfly permutation instruction and the latter three also support both bfly and ibfly

permutation instructions (while the circuit that supports grp does not need these instruc-

tions, the hardware is required for static pex and pdep). The circuits in Figures 3.22

71

and 3.23 are implemented with a 3-stage pipeline. The hardware decoder occupies the first

2 pipeline stages due to its slow parallel prefix population counter. The butterfly (or inverse

butterfly) network is in the third stage. The static units are a single processor cycle – hence,

no pipelining is needed.

The various functional units were synthesized using Synopsys Design Compiler map-

ping to a TSMC 90nm standard cell library [77]. The designs were compiled to optimize

timing. The decoder circuit was initially compiled as one stage and then Design Compiler

automatically pipelined the subcircuit. Timing and area figures are as reported by Design

Compiler. We also synthesized a reference ALU using the same technology library as a

reference for latency and area comparisons.

Table 3.2 summarizes the timing and area for the circuits. (Note that all latency and area

numbers in this thesis include register overhead.) This shows that the 64-bit functional unit

in Figure 3.20 supporting static pex and pdep has a shorter latency than that of a 64-bit

ALU and about 87% of its area (in NAND-gate equivalents). Splitting the static unit into

two units one supporting pex only and one pdep only does not increase the area. Hence,

we may choose to split the unit in order to accomodate superscalar execution.

In contrast, to support variable and loop-invariant pex and pdep (as in Figure 3.22),

which requires a hardware decoder, the advanced bit manipulation functional unit will have

cycle time 15% slower than an ALU and will be 2.17× larger. To also support a fast

implementation of the grp instruction, the proposed new unit will be 23% slower than an

ALU and about 3.2× larger in area.

Table 3.3 shows the number of different circuit types, to give a sense for why the func-

tional units supporting variable pex.v, pdep.v and grp are so much larger. It clearly

shows that supporting variable operations comes at a high price. The added complexity

is due to the complex decoder combinational logic and to the additional pipeline registers

(this is an irregular number due to the pipepline cutting across the decoder) and multiplexer

logic (for the ARs inputs, the control bit inputs and to select from multiple outputs). This

72

Table 3.2: Latency and area of functional units

Functional Latency Latency Relative Latency Relative Area Area Relative Area RelativeUnit (ns) to ALU to pex, pdep (NAND gates) to ALU to pex, pdepALU 0.5 1 1.05 7.5K 1 1.14pex 0.48 0.95 1 3.3K 0.43 0.50pdep 0.48 0.95 1 3.3K 0.44 0.50pex, pdep 0.48 0.95 1 6.6K 0.87 1pex.v, pdep.v 0.56 1.12 1.18 16.3K 2.17 2.48grp 0.62 1.23 1.30 24.2K 3.22 3.68

Table 3.3: Number of registers and logical blocks in functional units

Functional Unit 64-bit Pipeline Butterfly and Inverse Hardware MuxesRegisters Butterfly Networks Decoders

pex, pdep 0 2 0 1pex.v, pdep.v ∼ 9.32 2 1 13grp ∼ 15.1 3 2 14

explains why in Table 3.2, the variable circuits (Figures 3.22 and 3.23) have approximately

18-30% longer cycle time latencies compared to the static case (Figure 3.20). They are also

2.5 to 3.7 times larger than the static unit.

3.4 Summary

This chapter introduces the parallel extract instruction to perform bit gather operations

and the parallel deposit instruction to perform bit scatter operations. Datapaths for these

instructions were also discussed. We showed that that parallel extract maps to the inverse

butterfly network and parallel deposit maps to the butterfly network.

We then presented an algorithm for decoding the pex or pdep bitmask into the con-

trol bits for the datapaths. A hardware implementation of this algorithm was also pre-

sented. This hardware decoder has two basic components – parallel prefix population count

(POPCNT) and left-rotate-and-complement (LROTC) subcircuits.

With this foundation, we presented the design of a standalone advanced bit manip-

73

ulation functional unit that supports bfly, ibfly, pex and pdep. The high cost of

the hardware decoder that translates the pex or pdep bitmask into the datapath controls

prompted the definition of two variants of the pex and pdep instructions – static versions,

in which the datapath control bits are precomputed using Algorithm 3.11 in software (by

the compiler or programmer), and loop-invariant versions, in which hardware decoding is

performed once and then the static versions of the instructions are used. This led to the

design of two alternative functional units – one that supports only static pex and pdep

operations and one that supports variable pex.v and pdep.v instructions as well as loop

invariant operations. The unit that supports only the static instructions is faster (0.95× the

latency) and smaller (0.9× the area) than an ALU.

The next chapter of the thesis presents a new design for the standard shifter using the

advanced bit manipulation functional unit as its basis.

3.5 Appendix: Decoding the pexBitmask into Inverse But-

terfly Control Bits

We will now show that the inverse butterfly control bits for a pex operation can also be

obtained using Algorithm 3.11. This can be shown using an approach similar to that used

for pdep on butterfly in Section 3.2.

In the final stage of the inverse butterfly network, we call X the selected data bits that

pass through in R, Y the selected data bits that are transferred from L to R and Z the

selected data bits that pass through in L. Consider the possible combinations of X , Y and

Z:

i. X alone exists – when there are no selected data bits in L,

ii. Y alone exists – when there are no selected data bits in R,

74

Figure 3.24: Final stage of pex operation.

iii. X and Y exist – when no selected data bits that start in L stay in L,

iv. X and Z exist – when all the data bits in R are selected and there are some selected

data bits in L, or

v. X , Y and Z exist.

When X exists (i, iii, iv and v), it must be the rightmost bits of R, as the output of pex

is the selected data bits compressed maintaining their relative ordering and right justified

in the result.

When Y exists (ii, iii and v), it must be rotated left from the midpoint by the size of X

at the input to the final stage so that Y and X will be contiguous at the output of the final

stage after Y is swapped from L to R.

When Z exists (iv and v), it must be the rightmost bits in L so that when it is passed

through in L in the final stage it is contiguous with the bits in R.

When Y and Z exist (v), Y must additionally be the leftmost bits of L so that when

it is swapped into R in the final stage it is contiguous to Z on its left. Consequently, Y

and Z must be contiguous mod n/2 in L. Additionally we can view Y and Z as being the

rightmost bits of L at the output to stage lg(n)− 1 which are then left rotated by |X| prior

to the input to stage lg(n). See Figure 3.24.

Thus, prior to the final stage we have two half-sized inverse butterfly networks, L and R,

75

with a pex operation performed within each half network (first row of Figure 3.24 where

selected data bits ZY are the right justified and compressed selected data bits in L and X is

the right justified and compressed selected bits in R). We then iterate backwards through the

stages, performing a local pex operation within each half subnetwork. Between the stages

we explicitly left rotate each local L of by the number of selected data bits in the local R.

We use Fact 3.8 to route the bits in the local pex operation. This process is illustrated for

an example pex operation in Figure 3.25(a), broken down into steps in Figure 3.25(b).

(a)

(b) (c)

Figure 3.25: (a) 8-bit pex operation (b) mapped onto inverse butterfly network with explicit leftrotations between stages and (c) without explicit rotations of data bits by modifying the control bits.

76

In the first stage in Figure 3.25(b), we perform a local pex within each 2-bit subnet-

work:

• a belongs in position 6 for the local 2-bit pex, so it is swapped from L to R to yield

0a;

• c belongs in position 4, so it is swapped to yield 0c;

• e and f are in the correct positions (3 and 2, respectively), so they passed through to

yield ef;

• h is in position 0, so it is passed through to yield 0h.

Then, prior to stage 2, we explicitly left rotate each local L of stage 2 by the number of “1”

bits in the corresponding local R of stage 2: 0a is left rotated by 1 to a0, as there is one

selected data bit (c) in the corresponding local R, and ef is also left rotated by 1 to fe, as

there also is one selected data bit (h) in the corresponding local R.

In stage 2, we perform a local pex within each 4-bit subnetwork:

• a belongs in position 5 and c in position 4 for the local 4-bit pex, so a is swapped

to yield 00ac.

• e belongs in position 2, f in position 1 and h in position 0, so f is swapped to yield

0efh.

Prior to stage 3, we explicitly left rotate L of stage 3, 00ac, by 3, the number of selected

bits in R of stage 3 (efh), to c00a.

In the final stage, c belongs in position 3 so it is swapped from L to R to yield the

overall output 000acefh.

Rather than explicitly left rotating L after each stage, we can incorporate the rotations

directly into the routing of the pex operation by modifying the control bits. This is based

77

on a property of inverse butterfly networks that given a permutation peformed by the in-

verse butterfly network, a rotation of that permutation can also be achieved by rotating and

complementing the control bits by the same amount. (This can be shown following using

an argument similar to one used for Fact 3.6, the analogous fact for butterfly networks.)

Stage i of the inverse butterfly network can add 2i positions to the rotation, i.e., at

the output of the first stage the bits can be rotated by 0 or 1 positions from the initial

permutation. In the second stage, the bits can be rotated an additional 0 or 2 positions, or

the bits can rotated from the inputs by 0, 1, 2 or 3 positions and so on. Thus, we can rotate

the output of any stage of a subnetwork starting from any input and any initial permutation.

Consider the example of Figure 3.25(b), where we wish to eliminate the explicit data

rotations in the local L halves by rotating the control bits instead. To add a left rotation by

1 of 0a and ef though stage 1, we (left rotate and) complement the control bit for those

two subnetworks by 1 and we get the desired result (a0 and fe) as shown in Figure 3.26.

Then, to add a left rotation by 3 in L though stage 2, we left rotate and complement the

control bits by 3 in the first 2 stages of L. The first stage control bits are complemented 3

times in succession: 0 → 1 → 0 → 1 and 1 → 0 → 1 → 0 and the second stage control

bits are left rotated and complemented by 3 positions: 10→ 00→ 01→ 11. The result is

shown in Figure 3.25(c), where we see no explicit rotations, yet nevertheless the input of L

in stage 3 is still c00a and the overall output is still 000acefh.

Hence, the derivation of the control bits for the inverse butterfly network for a pex

operation is in fact identical to the derivation of the control bits for the pdep on the butterfly

network. At each stage, starting from stage 1, in each subnetwork, we pass through the k

rightmost bits by setting the k rightmost control bits to “0”, if there are k “1”s in the local

R of the mask. This is generated by LROTC(“1 . . . 1”, k). We then left rotate the output

by the number of “1”s in the next local R by left rotating and complementing the control

bits from stage 1 through the current stage by that amount. When we add up the effect of

rotating the outputs of each stage – that each rotation requires rotating the control bits from

78

(a)

Figure 3.26: 8-bit pex operation with explicit left rotation after stage 1 eliminated.

stage through that stage 1 – the net result is a LROTC of the control bits by the total number

of “1”s to the right of the subnetwork.

So we arrive at the same result: the initial control bits are obtained by a LROTC of a

string of all “1”s the length of the local R by the number of “1”s in the local R of the mask

and are then corrected by a further LROTC by the total number of “1”s to the right in the

mask. These can be combined into a single LROTC of the all one string by the total number

of “1”s in the local R and to the right. Thus Algorithm 3.11 also yields the configuration

bits for mapping pex to inverse butterfly, with the one caveat that the controls for stage i

of the butterfly datapath is equivalent to stage lg(n)− i+ 1 in the inverse butterfly datapath

(as indicated in the final line of Algorithm 3.11).

79

Chapter 4

Advanced Bit Manipulation Functional

Unit as a Basis for Shifters

In the previous chapter, we described a standalone advanced bit manipulation functional

units that supports bit permutation, bit gather and bit scatter operations. In this chapter,

we propose using the advanced bit manipulation functional unit as a new basis for shifters,

rather than simply adding it to a processor core, or enhancing the current shifter to also

support the above advanced bit manipulation functions. We propose replacing two existing

functional units (the shifter and the multimedia-mix functional units in Itanium processors)

with the new permutation functional unit and performing all the operations previously done

on these existing shifters as well as the advanced bit permutation, bit gather and bit scatter

operations on the same datapath.

This chapter is organized as follows. Section 4.1 lists the basic shifter instructions.

Section 4.2 then shows that the advanced bit manipulation functional unit can perform

any of these basic operations. The hard part is determining how the controls of the lg(n)

stages of the circuit should be set, and we give definitive algorithms for this. Section 4.3

gives a detailed description of the complete new shift-permute functional unit. Section 4.4

discusses how this new functional unit can be considered an evolution of the design of

80

shifters and compares the new design to the classic log shifter. Section 4.5 summarizes this

chapter.

4.1 Basic Shifter Operations

The basic shifter instructions consist of two groups and are summarized in Table 4.1. The

first group consists of the shift and rotate instructions supported in essentially all micropro-

cessors. These instructions include right or left shifts (with zero or sign propagation), and

right or left rotates. While a few microprocessors support only shifts but not rotates, we will

consider rotate as a basic supported operation. The second group of instructions include

extract and deposit instructions and mix instructions, all introduced in Chapter 2. No ISA

currently supports mix for subwords smaller than a byte. In our proposed new functional

unit, mix for bits – and for all subword sizes that are powers of 2 – are supported. This

includes 12 mix operations: mix.left and mix.right for each of 6 subword sizes of

20, 21, 22, 23, 24, 25, for a 64-bit datapath processor.

4.2 Basic Shifter Operations on the Inverse Butterfly Dat-

apath

The key conceptual insight comes from recognizing that the set of basic shifter and mix

operations in Table 4.1 is based on minor variations on a rotation operation, and that any

rotation operation can be achieved on an inverse butterfly circuit (or on a butterfly circuit.)

Theorem 4.1. An inverse butterfly circuit can achieve any rotation of its input.

This is a well-known property of inverse butterfly circuits. A proof of this can be found

in [78], where rotations are called cyclic shifts.

81

Table 4.1: Standard shifter operations

Instruction Descriptionrotr r1 = r2, shamt

Right rotate of r2. The rotate amount is either an immediate in theopcode (shamt), or specified using a second source register r3.

rotr r1 = r2, r3

rotl r1 = r2, shamtLeft rotate of r2.

rotl r1 = r2, r3shr r1 = r2, shamt Right shift of r2 with vacated positions on the left filled with the sign

bit (most significant bit of r2) or zero-filled. The shift amount iseither an immediate in the opcode or specified using a second sourceregister r3.

shr r1 = r2, r3shr.u r1 = r2, shamtshr.u r1 = r2, r3shl r1 = r2, shamt

Left shift of r2 with vacated positions on the right zero-filled.shl r1 = r2, r3extr r1 = r2, pos, len Extraction and right justification of single field from r2 of length len

from position pos. The high order bits are filled with the sign bit ofthe extracted field or zero-filled.

extr.u r1 = r2, pos, len

dep.z r1 = r2, pos, len Deposit at position pos of single right justified field from r2 of lengthlen. Remaining bits are zero-filled or merged from second sourceregister r3.

dep r1= r2, r3, pos, len

mix {r,l} {0,1,2,3,4,5} Select right or left subword from a pair of subwords, alternatingbetween source registers r2 and r3. Subword sizes are 2i bits, for i =0, 1, 2, . . ., 5, for a 64-bit processor.

r1 = r2, r3

Corollary 4.2. An enhanced inverse butterfly circuit can perform on its input:

a. Right and left shifts

b. Extract operations

c. Deposit operations

d. Mix operations

Proof. This follows from Theorem 4.1, with these operations modeled as a rotate with

additional logic handling zeroing, or sign extension from an arbitrary position, or merging

bits from the second source operand. Mix is modeled as a rotate of one operand by the

subword size and then a merge of subwords alternating between the two operands.

As the inverse butterfly circuit only performs permutations without zeroing and without

82

replication, the circuit must be enhanced with an extra 2:1 multiplexer stage at the end that

either selects the rotated bits “as is” or other bits which are precomputed as either zero, or

the sign bit (replicated), or the bits of the second source operand, depending on the desired

operation.

Corollary 4.3. Theorem 4.1 and Corollary 4.2 are true for the butterfly network as

well.

Proof. The butterfly and inverse butterfly networks exhibit a reverse symmetry of their

stages from input to output. Thus a rotation on the inverse butterfly network is equivalent

to a rotation in the opposite direction on the butterfly network when the flow through the

network is reversed (see Figure 4.1). Hence, a butterfly circuit can also achieve any rotation

of its inputs. As in Corollary 4.2, a butterfly network enhanced with an extra multiplexer

stage at the end is needed to handle zeroing or sign extension or merging bits from the

second source operand.

Figure 4.1: Left rotate by three on inverse butterfly is equivalent to right rotate by three on butterfly.

We first show how control bits are obtained for rotations on an inverse butterfly circuit

83

in Section 4.2.1, then for the other operations in Section 4.2.2.

4.2.1 Determining the Control Bits for Rotations

To achieve a right (or left) rotation by s positions, for s = 0, 1, 2, . . ., n− 1, using the n-bit

wide inverse butterfly circuit with lg(n) stages, the input must be right (or left) rotated by

s mod 2j within each 2j-bit wide inverse butterfly circuit at each stage j. This is because

from stage j + 1 on, the inverse butterfly circuit can only move bits at granularities larger

than 2j positions (so the finer movements must have already been performed in the prior

stages). We first give a conceptual explanation of this, then a formal constructive proof to

obtain the actual control bits for a rotation.

An n-bit inverse butterfly circuit can be viewed as two (lg(n) − 1)-stage circuits fol-

lowed by a stage that swaps or passes through paired bits that are n/2 positions apart. To

right rotate the input inn−1 . . . in0 by s positions, the two (lg(n) − 1)-stage circuits must

have right rotated their half inputs by s′ = s mod n/2 and the input to stage lg(n) must be

of the form:

inn/2+s′−1 . . . inn/2 inn−1 . . . inn/2+s′ || ins′−1 . . . in0 inn/2−1 . . . ins′ (4.1)

as the last stage can only move bits by n/2 positions. This is illustrated in Figure 4.2.

When the rotation amount s is less than n/2 then the bits that wrapped around in the

(lg(n) − 1)-stage circuits (cross-hatched) must be swapped in the final stage to yield the

input right rotated by s (Figure 4.2(a)):

ins′−1 . . . in0 inn−1 . . . inn/2+s′ || inn/2+s′−1 . . . inn/2 inn/2−1 . . . ins′ (4.2)

When the rotation amount is greater than or equal to n/2 then the bits that do not wrap in

the (lg(n) − 1)-stage circuits (solid) must be swapped in the final stage to yield the input

84

right rotated by s (Figure 4.2(b)):

inn/2+s′−1 . . . inn/2 inn/2−1 . . . ins′ || ins′−1 . . . in0 inn−1 . . . inn/2+s′ (4.3)

(a)

(b)

Figure 4.2: (a) rotation by s < n/2 and (b) by s ≥ n/2.

For example, consider the 8-bit inverse butterfly network with right rotation amount

s = 5, depicted in Figure 4.3. As s = 5 is greater than n/2 = 4, the bits that did not wrap

in stage 2 are swapped in stage 3 to yield the final result.

As the rotation amount through stage 2, s mod 22 = 5 mod 4 = 1, is less than n/4 = 2,

the bits that did wrap in stage 1 are swapped in stage 2 to yield the input to stage 3.

As the rotation amount through stage 1, s mod 21 = 5 mod 2 = 1, is equal to than

n/8 = 1, the bits that did not wrap in the input, i.e., all the bits, are swapped in stage 1 to

yield the input to stage 2.

We can mathematically derive recursive equations for the control bits, cbj , j = 1, 2,

. . ., lg(n), for achieving rotations on an inverse butterfly datapath. These equations yield a

85

Figure 4.3: Right rotate by 5 on 8-bit, 3-stage inverse butterfly network

compact circuit (shown in Figure 4.4) for the (rotation) control bit generator.

From (4.1)-(4.3) and Figure 4.2, we observe that the pattern for the control bits for the

final stage, which we call cblg(n), for a rotate of s bits, is:

cblg(n) =

1s || 0n/2−s, s < n/2

0s−n/2 || 1n/2−(s−n/2), s ≥ n/2

(4.4)

where ak is a string of k “a”s,“1” means “swap” and “0” means “pass through.” Note that

s = s mod n/2 when s < n/2 and s− n/2 = s mod n/2 when s ≥ n/2:

cblg(n) =

1s mod n/2 || 0n/2−(s mod n/2), s mod n < n/2

0s mod n/2 || 1n/2−(s mod n/2), s mod n ≥ n/2

cblg(n) =

1s mod n/2 || 0n/2−(s mod n/2), s mod n < n/2

∼(1s mod n/2 || 0n/2−(s mod n/2)

), s mod n ≥ n/2

(4.5)

where ∼ indicates negation.

Furthermore, due to the recursive structure of the inverse butterfly circuit, we can gen-

86

eralize (4.5) by substituting j for lg(n), 2j for n and 2j−1 for n/2:

cbj =

1s mod 2j−1 || 02j−1−(s mod 2j−1), s mod 2j < 2j−1

∼(

1s mod 2j−1 || 02j−1−(s mod 2j−1)), s mod 2j ≥ 2j−1

(4.6)

There are j bits in s mod 2j , with the most significant bit denoted sj−1. The condition

s mod 2j < 2j−1 is equivalent to sj−1 being equal to 0 and the condition s mod 2j ≥ 2j−1

is equivalent to sj−1 being equal to 1:

cbj =

1s mod 2j−1 || 02j−1−(s mod 2j−1), sj−1 = 0

∼(

1s mod 2j−1 || 02j−1−(s mod 2j−1)), sj−1 = 1

(4.7)

Equation (4.7) can be rewritten as the pattern XORed with sj−1:

cbj =(

1s mod 2j−1 || 02j−1−(s mod 2j−1))⊕ sj−1. (4.8)

Since s mod k ≤ k − 1, k − (s mod k) ≥ 1 and hence the length of the string of zeros

in (4.8) is always ≥ 1(k = 2j−1). Consequently, the least significant bit of the pattern

(prior to XOR with sj−1) is always “0”:

cbj =(

1s mod 2j−1 || 02j−1−1−(s mod 2j−1) || 0)⊕ sj−1

cbj =((

1s mod 2j−1 || 02j−1−1−(s mod 2j−1))⊕ sj−1

)|| sj−1 (4.9)

We call the bit pattern inside the inner parenthesis of (4.9) f(s, j), a string of 2j−1 − 1

bits with the s mod 2j−1 leftmost bits set to “1” and the remaining bits set to “0.” This

function definition is only for j ≥ 2, and the function is defined to return the empty string

for j = 1:

87

f(s, j) =

1s mod 2j−1 || 02j−1−1−(s mod 2j−1), j ≥ 2

{} , j = 1

(4.10)

cbj = (f(s, j)⊕ sj−1) || sj−1 (4.11)

Note that we can derive f(s, j + 1) from f(s, j):

f(s, j + 1) = 1s mod 2j || 02j−1−(s mod 2j) (4.12)

If bit sj−1 = 0 then s mod 2j = s mod 2j−1: h

f(s, j + 1) =1s mod 2j−1 || 02j−1−(s mod 2j−1)

=1s mod 2j−1 || 02j−1−1−(s mod 2j−1) || 02j−1

f(s, j + 1) =f(s, j) || 02j−1

(4.13)

If bit sj−1 = 1 then s mod 2j = 2j−1 + s mod 2j−1:

f(s, j + 1) =12j−1+s mod 2j−1 || 02j−1−(2j−1+s mod 2j−1)

=12j−1 || 1s mod 2j−1 || 02j−1−1−(s mod 2j−1)

f(s, j + 1) =12j−1 || f(s, j) (4.14)

Combining (4.13) and (4.13), we get:

88

f(s, j + 1) =

f(s, j) || 02j−1

, sj−1 = 0

12j−1 || f(s, j), sj−1 = 1

f(s, j + 1) =

f(s, j) || 0 || 02j−1−1, sj−1 = 0

12j−1−1 || 1 || f(s, j), sj−1 = 1

(4.15)

Since f(s, j) is a string of 2j−1 − 1 bits, we can replace the string of ones and zeros in

(4.15) by f(s, j) ORed (+) and ANDed (•) with 1 and 0, respectively:

f(s, j + 1) =

f(s, j) + 0 || 0 || f(s, j) • 0, sj−1 = 0

f(s, j) + 1 || 1 || f(s, j) • 1, sj−1 = 1

f(s, j + 1) = (f(s, j) + sj−1 || sj−1 || f(s, j) • sj−1) (4.16)

From (4.10) and (4.16) we obtain a simple recursive expression for f(s, j):

f(s, j) =

(f(s, j − 1) + sj−2 || sj−2 || f(s, j − 1) • sj−2) , j ≥ 2

{} , j = 1

(4.17)

Figure 4.4 depicts the hardware implementation of the control bit generator for rota-

tions. Equation (4.17) is used to derive f(s, 2), f(s, 3), f(s, 4) and f(s, 5). Also, the

control bits for rotations, cb1, cb2, cb3, cb4 and cb5, are obtained using equation (4.11). This

implementation is based on sharing of gates by reusing f(s, j) for both cbj and f(s, j + 1).

We now illustrate the use of these equations with the example of Figure 4.3, the 8-bit

89

Figure 4.4: Rotation control bit generator

90

inverse butterfly network with right rotation amount s = 5 (s2s1s0 = 101). The first stage

control bit, cb1, replicated for the four 2-bit circuits, is given by equations (4.11) and (4.17):

cb1 = (f(5, 1)⊕ s0) || s0 = {} ⊕ s0 || s0 = s0 = 1

The second stage control bits, cb2, replicated for the two 4-bit circuits, are given by:

cb2 =(f(5, 2)⊕ s1) || s1

=((f(5, 1) + s0 || s0 || f(5, 1) • s0)⊕ s1) || s1

=(({}+ s0 || s0 || {} • s0)⊕ s1) || s1

=(s0 ⊕ s1) || s1

=(1⊕ 0) || 0

=10.

Note that f(5, 2) = 1 in the above. The final stage control bits, cb3, are given by:

cb3 =(f(5, 3)⊕ s2) || s2

=((f(5, 2) + s1 || s1 || f(5, 2) • s1)⊕ s2) || s2

=((1 + s1 || s1 ||1 • s1)⊕ s2) || s2

=((1 + 0 || 0 ||1 • 0)⊕ 1) || 1

= ∼ (100) || 1

=0111.

Figure 4.3 shows that this configuration of the inverse butterfly circuit does indeed right

91

rotate the input by 5 mod 8 (and that the outputs of stage 2 are rotated by 5 mod 4 = 1 and

that the outputs of stage 1 are rotated by 5 mod 2 = 1).

4.2.2 Determining the Control Bits for Other Shift Operations

The other operations (shifts, extract, deposit and mix) are modeled as a rotation part plus a

masked-merge part with zeroes, sign bits or second source operand bits. The rotation part

can use the same rotation control bit generator described above to configure the inverse

butterfly network datapath. We achieve the masked-merge part by using an enhanced in-

verse butterfly datapath with an extra multiplexer stage added as the final stage. The mask

control bits are “0” when selecting the rotated bits and “1” when selecting the merge bits.

We now describe how this is used to generate these other operations: shift, extract, deposit

and mix.

For a right shift by s, the s sign or zero bits on the left are merged in. This requires

a control string 1s || 0n−s for the extra multiplexer stage. From the definition of f(s, j),

(4.10), we see that f(s, lg(n)+1) is the string 1s || 0n−1−s. Thus the desired control string is

given by f(s, lg(n) + 1) || 0. (Recall that s < n therefore the least significant bit is always

“0”, i.e., the least significant bit is always selected from the inverse butterfly datapath.)

f(s, lg(n) + 1) can easily be produced by extending the rotation control bit generator by

one extra stage. For left shift, which can be viewed as the left-to-right reversal of right

shift, the control bits for the extra stage are obtained by reversing left-to-right the right

shift control string to yield 0n−s || 1s.

For extract operations, which are like right shift operations with the left end replaced

by the sign bit of the extracted field or zeros, our enhanced inverse butterfly network selects

in its extra multiplexer stage the rotated bits or zeros or the sign bit of the extracted field

i.e., the bit in position pos+len-1 in the source register (see Figure 3.1(a)). The bit can be

selected using an n : 1 multiplexer.

92

The mask pattern for extract is n − len “1”s followed by len “0”s, (1n−len || 0len), to

propagate the sign bit of the extracted field in the output (which is in position len − 1)

to the high order bits (Figure 4.5(a)). Note that cblg(n)+1(len) = {f(len, lg(n) + 1) ⊕

lenlg(n) || lenlg(n)} is 1len || 0n−len. (len ranges from 0 to n and has lg(n)+1 bits: lenlg(n)...0.)

So reversing left-to-right cblg(n)+1(len) yields 0n−len || 1len and then negating it produces

1n−len || 0len, the correct bit pattern for stage lg(n) + 1 (Figure 4.5(b)).

(a) (b)

Figure 4.5: (a) Mask for extract operation. (b) Generation of mask for extract operation.

For deposit operations, which are like left shift operations with the right and left ends

replaced by zeros or bits from the second operand, our enhanced inverse butterfly network

selects in its extra multiplexer stage the rotated bits or zeros or bits from the second input

operand. The correct pattern is a string of n− pos− len“1”s followed by len “0”s followed

by s=pos “1”s (1n−pos−len || 0len || 1pos) to merge in bits on the right and left around the

deposited field (Figure 4.6(a)).

We can generate this pattern by observing that cblg(n)+1(pos + len) = {f((pos +

len), lg(n)+1)⊕ (pos + len)lg(n) || (pos + len)lg(n)} is 1pos+len || 0n−pos−len (first line of Fig-

ure 4.6(b)). (pos + len ranges from 0 to n and has lg(n)+1 bits: (pos + len)lg(n)...0.) Revers-

ing left-to-right cblg(n)+1(pos+len) yields 0n−pos−len || 1pos+len (second line of Figure 4.6(b))

and then negating it produces 1n−pos−len || 0pos+len (third line of Figure 4.6(b)). Bitwise OR-

ing this with the left shift control string, 0n−pos || 1pos, yields 1n−pos−len || 0len || 1pos (last line

of Figure 4.6(b)), the correct pattern for the masked-merge part of the deposit operation.

For mix operations, the enhanced inverse butterfly network selects in its extra multi-

93

(a) (b)

Figure 4.6: (a) Mask for deposit operation. (b) Generation of mask for deposit operation.

plexer stage the rotated bits or the bits from the second input operand. The control bit

pattern is simply a pattern of alternating strings of “0”s and “1”s, the precise pattern de-

pending on the subword size and whether mix left or mix right is executed. These patterns

can be hard coded in the circuit for the 12 mix operations (6 operand sizes × 2 directions).

Table 4.2 summarizes the mask patterns and merge bits for all the operations supported

by the shifter.

Table 4.2: Mask patterns and merge bits for basic and advanced shifter operations

Instruction Mask Merge Bitsrotr, rotl 0n -shr 1s || 0n−s r2{n− 1}n (sign bit)shr.u 1s || 0n−s 0n

shl 0n−s || 1s 0n

extr 1n−len || 0len r2{pos + len− 1}n (sign bit of extracted field)extr.u 1n−len || 0len 0n

dep.z 1n−pos−len || 0len || 1pos 0n

dep 1n−pos−len || 0len || 1pos r3mix.left (0k || 1k)n/2k, k = 1, 2, 4, 8, 16, 32 r3mix.left (1k || 0k)n/2k, k = 1, 2, 4, 8, 16, 32 r3bfly, ibfly 0n -pex 0n -pdep r3 0n

94

4.3 New Shifter Implementation

We now describe the (64-bit) shift-permute functional unit. We first consider the most basic

shifter, supporting only shift and rotate, based on the inverse butterfly network and then the

full shift-permute functional unit also supporting extract, deposit, mix, bit permutation, bit

gather and bit scatter instructions.

4.3.1 Basic Shifter

A block diagram of the basic shifter is show in Figure 4.7. This unit consists of an inverse

butterfly circuit enhanced with an extra multiplexer stage for shift and rotate operations;

a control bit generator for generating the rotation control bits; and a masked merge block

for generating the mask control bit pattern and the merge bits for rotates and logical and

unsigned shifts.

The inverse butterfly circuit is as shown in Figure 2.14. The control bit generator is

shown in Figure 4.8, which includes the rotation control bit generator (Figure 4.4). Right

rotates and shifts are handled by using the patterns from the rotation control bit generator

directly and left rotates and shifts are handled by reversing the patterns right-to-left (as

shown by the circular arrow). The output of this block is also fed to the masked merge

Figure 4.7: New shifter functional unit

95

Figure 4.8: Control bit generator for shifter functional unit

stage.

The masked-merge block generates two bit patterns: the merge bits and the mask bits

which control whether the output bits are chosen from the inverse butterfly datapath or from

the merge bits. This include the mask and merge patterns for shifts and rotates in Table 4.2.

4.3.2 Full Shift-Permute Unit

An overview of the full shift-permute functional unit is shown in Figure 4.9. The unit

consists of:

• a butterfly circuit enhanced with an extra multiplexer stage – for butterfly permuta-

tions and parallel deposit operations (and possibly basic shift operations),

• an inverse butterfly circuit with a pre-masking stage and an extra multiplexer stage

– for inverse butterfly permutations, parallel extract operations and basic shift opera-

tions,

• a control bit generator – for generating or supplying the control bits for all the oper-

ations, and

• a masked merge block – for generating or supplying the mask control bit pattern and

the merge bits.

The butterfly and inverse butterfly circuits are as shown in Figure 2.14. The extra mul-

tiplexer stage after the butterfly circuit is used for pdep masking or for the masked-merge

96

Figure 4.9: New shift-permute functional unit

of the basic shift operations. However, as seen from the rotation control bit generator of

Figure 4.4, the control bits produced first are those for the first stage of the inverse butterfly

network, which corresponds to the final stage of the butterfly network. Consequently, we

may choose to have the basic shift operations performed on the inverse butterfly network

alone. The pre-masking stage for the inverse butterfly network is used for pex masking,

which occurs before the bits are permuted. The extra multiplexer stage is used for the

masked-merge of the basic shift operations.

The control bit generator is shown in Figure 4.10. There are three sources for the control

bits:

• the rotation control bit generator – for basic shift operations,

• the application registers – for butterfly and inverse butterfly permutations, and static

parallel extract and parallel deposit operations, and

• the pex/pdep decoder – for variable parallel extract and parallel deposit operations

The rotation control bit generator is similar to the control bit generator of the basic

shifter. The application registers store the control bit patterns for bfly and ibfly per-

97

Figure 4.10: Control bit generator for shift-permute functional unit

mutations as well as the patterns for static and loop-invariant pex and pdep operations.

The application registers are written either directly from the GPRs with the mov ar in-

struction or from the pex/pdep decoder with the setb and setib instructions.

The pex/pdep decoder is that of Figure 3.14 and is used to generate the control bits for

variable pex.v and pdep.v operations, in which case the bits are output directly to the

ibfly and bfly circuits, respectively, reversing the order of the stages for pex.v. The

decoder is also used for loop-invariant pex and pdep, in which case the bits are written

to the application registers. Note that this decoder has a long latency and large area (as

discussed in Section 3.3).

The masked-merge block generates the merge bits and the mask controls for all the

operations. The patterns are summarized in Table 4.2.

4.4 Evolution of Shifters

Our new design represents an evolution of the shifter from the classical designs to a new

shift-permute functional unit which can perform both basic shifter operations and very

98

Figure 4.11: 8-bit barrel shifter

sophisticated bit manipulations. One popular classic shifter architecture is the barrel shifter.

The barrel shifter essentially is an n-bit wide n : 1 multiplexer that selects the input shifted

by s positions, where s = 0, 1, 2, . . ., n−1 (Figure 4.11). The advantage of this design is that

there is only a single gate delay between the input and output. The disadvantages are that

n2 switch elements (pass transistors or transmission gates) are required, and long delays

due to decoding of the shift amount (which adds extra logic delay) and high capacitance as

each input fans out to n elements and each output fans in from n elements.

Due to the high capacitance of wires, a second popular shifter architecture emerged –

the log shifter (which is sometimes also referred to as the barrel shifter in literature [79]).

The log shifter shifts the input by decreasing powers of two or four and selects at each

stage the shifted version or the pass-through version from the previous stage (Figure 4.12).

The advantages are that only n × lg(n) or n × log4(n) elements are required and the shift

amount directly controls the multiplexer elements. The disadvantage is that there are lg(n)

or log4(n) gates between the input and output.

99

Figure 4.12: 8-bit log shifter

Left and right shifts can be performed by implementing two datapaths to perform left

and right shifts separately. Alternatively, only a single-direction shifter is used, e.g., only

right shifting; the left shift is performed by subtracting the left shift amount from the bit

width, n, or by conditionally reversing the input and output of the shifter [79]. Arithmetic

right shift is accomplished by conditionally propagating the sign bit rather than a zero bit.

Additionally, the shifters easily support rotations by wrapping around the bits.

Table 4.3 presents a high level comparison of the 3 shifter designs. The first two lines

contain the components that contribute to area. Both the log shifter and our new ibfly-based

shifter have n×lg(n) elements, while the barrel shifter has n2 elements. The log shifter also

has the fewest control lines (lg(n)) while the number of control lines for the new shifter

design is similar to the barrel shifter for basic shift operations – there are n − 1 control

lines as can be seen in Figure 4.4. When considering our new shifter performing advanced

operations, there are n/2× lg(n) control lines as each switch, or pair of elements, requires

an independent control bit. Thus we might expect that our new design has larger area than

the log shifter.

The next two lines pertain to latency. The datapath of the barrel shifter has a single

gate delay while the log shifter and our new design have lg(n) gate delay. However, both

the log shifter and our new design utilize narrow multiplexers limiting the capacitance at

output nodes while the barrel shifter uses wide n:1 multiplexers. All designs have long

100

Table 4.3: Comparison of shifter designs

Barrel Shifter Log Shifter Our IBFLY Shifter# Elements n2 n× lg(n) or n× lg(n)

n× log4(n)# Control Lines n lg(n) n− 1 for shifts

n/2× lg(n) for permutations# Gate delay (Datapath) 1 lg(n) or log4(n) lg(n)Mux width (Capacitance) n 2 or 4 2

wires to propagate control and data signals across many multiplexers.

We first used the method of logical effort [80] to compare the delay along the critical

paths for the barrel shifter, the log shifter and the inverse butterfly shifter. This estimates

the critical path in terms of FO4 gate equivalents, which is the delay of an inverter driving

4 similar inverters. We compare the latency of only the basic shifter operations on these

datapaths. As the 64-bit barrel shifter is impractical due to the capacitance on the output

lines, we implemented a 64-bit shifter as an 8-byte barrel shifter followed by an 8-bit barrel

shifter, which limits the number of transmission gates tied together to 8. We consider the

delay only from the input to the decoder through the two shifter levels for the barrel shifter

and through the three shifter levels for the log shifter.

For our proposed ibfly-based shifter, we consider the delay from the input to the control

bit generator (i.e., the rotation control bit generator of Figure 4.4 through the output of the

inverse butterfly circuit. According to the logical effort calculations, the delay for the barrel

shifter is 15.1 FO4 and the delay for the log shifter is 13.0 FO4, while the delay for inverse

butterfly shifter is 16.8 FO4. Thus the delay along the critical path our new proposed shifter

is 11% slower than that of the barrel shifter and 29% slower than a log shifter.

As the log shifter is the faster and more compact of the two current shifter designs

we implemented it and our new ibfly-based shifter design using a standard cell library. We

synthesized all designs to gate level, optimizing for shortest latency, using Design Compiler

mapping to a TSMC 90 nm standard cell library [77]. The results are summarized in

101

Table 4.4: Latency and area of units performing shift and rotate

Functional Unit Latency (ns) Relative Latency Area (NAND gates) Relative AreaLog Shifter 0.48 1. 5.8K 1(Reverse)Log Shifter 0.49 1.02 6.1K 1.06(2 Datapath)Our IBFLY shifter 0.51 1.06 4.5K 0.78ALU 0.50 1.03 7.5K 1.31

Tables 4.4-4.6.

Table 4.4 shows the result of a basic shifter that only implements shift and rotate in-

structions. For the log shifter, we evaluated two designs – one which only performs right

shift and rotate and reverses the input and output for left shift and rotate, and a design which

utilizes parallel datapaths for left and right shifts. The log shifter which reverses the input

and output is faster and smaller than the log shifter which uses two datapaths. Thus we use

it as a baseline for all comparisons and further analysis.

Our new shifter has 1.06× the latency of the log shifter, but is smaller than the log

shifter, at approximately 78% of the area. The discrepancy from the logical effort results

arises from the fact that the logical effort calculation considered the reduced datapath only.

(Note that accounting for the wires will increase the area for the ibfly-based shifter relative

to the log shifter, as mentioned.) For comparison, we also implemented an ALU synthe-

sized using the same standard cell library. Our new shifter is slightly slower (102% latency)

and substantially smaller (59% area) than this ALU. As the latency is greater than the ALU,

the use of a shifter based on an inverse butterfly network may impact cycle time. However,

as the difference in latency between all the designs is small, a full custom design of the

units should be done to fully compare the units.

Table 4.5 shows the result when both shifter architectures are enhanced to support ex-

tract, deposit and mix instructions. The critical path of the log shifter is now through the

extract sign bit propagation, so the latency is now comparable to that of the ibfly-based

102

Table 4.5: Latency and area of units performing shift, rotate, extract, deposit and mix

Functional Unit Latency (ns) Relative Latency Area (NAND gates) Relative AreaLog Shifter 0.57 1 6.7K 1Our IBFLY shifter 0.58 1.01 5.8K 0.87ALU 0.57 1 6.1K 0.92

Table 4.6: Latency and area of units performing advanced bit operations

Functional Unit Latency (ns) Relative Latency Area (NAND gates) Relative AreaLog Shifter∗ 0.57 1 6.7K 1Our IBFLY shifter∗ 0.58 1.01 5.8K 0.87ibfly, pex 0.63 1.10 9.0K 1.34

(1.2K for ARs)ibfly, bfly, pex, pdep 0.64 1.11 12.3K 1.83

(2.5K for ARs)ibfly, bfly, pex, pdep 1.22 2.13 16.2K 2.42pex.v, pdep.v (3.6K for ARs)∗Does not perform advanced bit manipulation operations

shifter. Our new ibfly-based design is still only 87% of the area of the log shifter. We

include the results for an ALU of similar latency, which has area less than the log shifter

but greater than the ibfly-based shifter.

Table 4.6 shows the results when we add support for advanced bit manipulation op-

erations to our new ibfly-based shifter, but not to the log-shifter. The first two lines are

the log shifter and ibfly shifter from Table 4.5, which do not perform advanced operations,

included as the baseline. The third line is a unit that supports the ibfly and static pex in-

structions. We consider this unit as the functionality of the butterfly circuit can be emulated

using inverse butterfly, albeit with a multi-cycle penalty. The latency increases are due to

extra multiplexing for the control bits and output. The area increases due to the ARs, the

extra multiplexers and the pex masking. This unit has 1.10× the latency and 1.34× the

area of the log shifter.

The fourth line is a unit that also supports bfly and static pdep, i.e., it is the equivalent

of standalone unit of Figure 3.20 enhanced to support standard shifter operations. The

103

latency increases slightly due to extra multiplexing and the area increases due to the second

(butterfly) datapath and second set of three ARs. This unit has 1.11× the latency and 1.83×

the area of the log shifter.

The fifth line is a unit that also supports variable pex.v and pdep.v. This is a non-

pipelined version of the functional unit. The latency increases and the area increases due to

the pex/pdep decoder and extra multiplexing. This unit has 2.13× the latency and 2.42×

the area of the log shifter (the pex/pdep decoder itself has 1.4× the latency of the log

shifter). A pipelined version of this unit would require three stages, two for the decoder

and one for the datapath (see Section 3.3.

We also performed an FPGA based implementation of the new functional units. The

goal of this implementation was to produce an actual hardware circuit which would fully

take into account the wiring requirements of the butterfly and inverse butterfly network.

An additional goal was to showcase the flexibility of the FPGA-based, fully customizable

PAX cryptographic coprocessor [81] of the PALMS research group by replacing its shifter

with the new shift-permute functional unit. The results are shown in Table 4.7, where the

target device is the Xilinx Virtex-II Pro XC2VP20 and the implementation was performed

using Xilinx ISE. The top of the table shows results for the basic shifter that supports only

shift and rotate while the bottom half shows results for the shifters that also support extract,

deposit and mix and also support the advanced bit operations of Chapter 3.

For the shifters that support only shift and rotate, our new shifter design has 0.85× the

latency of the log shifter with 19% more area. For the shifters that additionally support

extract, deposit and mix, our shifter has 1.05× the latency, but only 0.87× the area of the

log shifter. These results may be due to the granularity of logic synthesis in the FPGA –

for the basic shifter, our design requires one less level of LUTs, while for the shifter that

supports extract, deposit and mix, our shifter requires one more level of LUTs.

The unit that supports bfly, ibfly, static pex and pdep has 1.16× the latency and

1.98× the area of the log shifter. These results are comparable to those obtained for the

104

Table 4.7: Results of FPGA implementation

Functional Unit Latency Latency Relative Latency Relative Area Area Relative Area Relative(ns) to Log Shifter to ALU (Slices) to Log Shifter to ALU

Log Shifter 6.4 1 1.23 394 1 1.15Our IBFLY shifter 5.5 0.85 1.05 470 1.19 1.37ALU 5.2 0.81 1 342 0.87 1Log Shifter 6.7 1 1.29 748 1 2.19Our IBFLY shifter 7.1 1.05 1.36 654 0.87 1.91ibfly, bfly, pex, pdep 7.8 1.16 1.50 1481 1.98 3.76ibfly, bfly, pex, pdep 12.3 1.84 2.37 2457 3.28 6.24pex.v, pdep.vALU 5.2 0.78 1 342 0.46 1

standard cell implementation. Compared to the standard cell implementation of Table 4.6,

the FPGA unit that supports variable pex.v and pdep.v is faster and larger relative to

the log shifter, perhaps due to how the addition in the parallel prefix population counter

is synthesized. We note that the ALU is faster and smaller than all the shifter designs,

possibly due to specialized adder cells used in the ALU implementation.

From both standard cell and FPGA implementations we that supporting variable pex.v

and pdep.v operations has high cost, similar to the case with the standalone advanced ma-

nipulation units in Section 3.3. One may choose to simply pay this cost in order to support

the greatest range of operations. Alternatively, one may simply omit support for variable

operations and use the functional unit with only bfly, ibfly, static pex and pdep,

thereby forcing the use of a software routine to generate the control bits whenever variable

pex.v and pdep.v are performed. As we will see in Chapter 6, this may be a sufficient

alternative.

We remark that full custom designs of the ALU, log shifter and our new shift-permute

unit should be done, since standard cell or FPGA implementations may not reflect a fair

or accurate comparison – especially between the shifters and the ALU, which is typically

highly optimized by custom circuit design. Such circuit design is more appropriately done

by microprocessor custom circuit designers according to implementation-specific needs

105

and the process technology used.

4.5 Summary

The advanced bit manipulation functional unit can be enhanced to support the operations

usually performed by the standard shifter. We showed that standard shifter operations are

all variations on rotations and then designed a rotation control bit generator circuit, which

produces the control bits for the inverse butterfly (or butterfly) datapath based on the shift

amount. This new design is an evolution of shifter architectures away from the classic

barrel and log shifters.

As a new basis for shifters performing only existing shift, rotate, extract, deposit and

mix instructions, our new ibfly-based design has 1.01× the latency and 0.87× the area of

the log-based shifter. Our new unit that also supports static pex and pdep operations, has

1.11× the latency and 1.83× the area of the log shifter (standard cell implementation), but

is much more capable.

In the following chapter, we consider the bit matrix multiply operation which is used

to accelerate parity computation and can also perform a superset of the bit manipulation

operations. In Chapter 6 we analyze the applications that benefit from advanced bit manip-

ulation instructions.

106

Chapter 5

Bit Matrix Multiply

Another powerful bit manipulation operation is bit matrix multiply (bmm). bmm, which

multiplies two bit matrices contained in processor registers, performs a superset of the

advanced bit manipulation operations discussed in Chapter 3. Furthermore, bmm has many

additional capabilities such as finite field multiplication and parity computation. In this

chapter we discuss how bmm, which has heretofore been supported only by supercomputers

such as Cray [82], can be implemented in a commodity microprocessor.

This chapter is organized as follows. Section 5.1 gives the definition of the bit matrix

multiply operation. Section 5.2 discusses the basic operations that bmm can be used to

perform. Section 5.3 describes methods of computing bmm using existing instruction set

architectures. Section 5.4 proposes new architectural enhancements to accelerate bmm.

Section 5.5 compares the new proposals to the existing techniques. Section 5.6 discusses

related work in bmm implementation. Section 5.7 summarizes this chapter.

5.1 Definition of Bit Matrix Multiply Operation

The bit matrix multiply operation, bmm.n, multiplies two (n × n) bit matrices to produce

a third (n × n) bit matrix. The operation is defined in the first line of Table 5.1. The rows

107

Table 5.1: Definitions of Bit Matrix Multiply Instructions

Operation Definition Pseudo-Codebmm.n C = A, B A, B, C: for i from 1 to nBit Matrix Multiply n× n bit matrices for j from 1 to n

C = A × B mod 2 ci,j = ai,1b1,j ⊕ ai,2b2,j ⊕ . . .⊕ ai,nbn,j

bmmt.n C = A, B A, B, C: for i from 1 to nBit Matrix Multiply n× n bit matrices for j from 1 to n(with implicit transpose) C = A × BT mod 2 ci,j = ai,1bj,1 ⊕ ai,2bj,2 ⊕ . . .⊕ ai,nbj,n

of a matrix are numbered ascending from top to bottom, starting at 1. Within a row we

follow the normal matrix numbering convention starting at 1 on the left. (When referring

to bit numbering within a register, position n − 1 is on the left and 0 is on the right.) An

alternative formulation which transposes the B matrix (bmmt.n) is given in the second

line of Table 5.1. Note that the transpose is implicit – the instruction input is B, but the

multiplication is by BT .

The first version of bmm most closely resembles standard matrix multiplication and

therefore is most useful when performing finite field arithmetic. The transpose version

is most useful when we are taking the dot products of the rows of A and the rows of B.

Given the general correspondence between vector dot product and matrix multiplication

(a • b = a × bT for row vectors a and b), this version naturally follows. The transpose

version can also be used when performing finite field arithmetic if the data happen to be in

the transpose form. The transpose version of bmm is the version implemented by Cray.

In particular, we mention two cases for n, n = 8 and n = 64. The bmm.8 instruction

multiplies two 8 × 8 matrices contained in general purpose registers (see Fig 5.1). The

matrices are laid out such that row i corresponds to byte 8 − i in little-endian order. The

bmm.64 instruction multiplies two 64 × 64 matrices and allows for operations on whole

processor words at once. The Cray instruction is equivalent to bmmt.64.

108

(a)

(b)

Figure 5.1: (a) Layout of 8× 8 matrix in a register rB; (b) bmm.8 operation (1 bit of output)

5.2 Bit Matrix Multiply Basic Operations

We describe some basic operations that a bmm operation can perform.

Finite Field Multiplication – Bit matrix multiply can be used to perform multiplication

over GF(2k).

Subset Parity – Bit matrix multiply, bmm or bmmt, can be used to calculate the parity

of an arbitrary subset of the bits of a row of A. The bit positions from which the parity is

computed is specified by setting the corresponding bit positions to “1” in a column of B (or

row of B for the transpose version). Usage of bmmt may be easier for parity as the masks

can be directly specified in processor registers.

Transpose – The transpose version, bmmt, can perform bit matrix transpose by setting

A to I, the identity matrix. By Table 5.1 second row, C then equals BT .

109

Permutations – bmm can be used to perform arbitrary bit-level permutations with rep-

etitions and zeroing. A B matrix with a single “1” in each row and column produces a

permutation of the columns of the A matrix (“column exchange”). A B matrix with a sin-

gle “1” in each column, but more than one “1” in at least one row produces a permutation

with repetition. A matrix with a zero column will zero the corresponding column of the

result. Thus bmm can be used to perform any of the operations described in Chapter 3

(see Figure 5.2 for an example of bmm being used to form pex). bmm can also perform a

standard bit-wise and (with “1”s only appearing on the main diagonal), shift (with “1”s

only appearing on any single non-main diagonal) or rotate (with “1”s appearing on any

single wrapped-around diagonal). If, instead of B, the A matrix is a permutation matrix,

then the rows (or subwords) of B are permuted (“row exchange”).

a1,1 a1,2 a1,3 a1,4 a1,5 a1,6 a1,7 a1,8

a2,1 a2,2 a2,3 a2,4 a2,5 a2,6 a2,7 a2,8

a3,1 a3,2 a3,3 a3,4 a3,5 a3,6 a3,7 a3,8

a4,1 a4,2 a4,3 a4,4 a4,5 a4,6 a4,7 a4,8

a5,1 a5,2 a5,3 a5,4 a5,5 a5,6 a5,7 a5,8

a6,1 a6,2 a6,3 a6,4 a6,5 a6,6 a6,7 a6,8

a7,1 a7,2 a7,3 a7,4 a7,5 a7,6 a7,7 a7,8

a8,1 a8,2 a8,3 a8,4 a8,5 a8,6 a8,7 a8,8

×

0 0 0 1 0 0 0 00 0 0 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 0 0 00 0 0 0 0 1 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 00 0 0 0 0 0 0 1

=

0 0 0 a1,1 a1,3 a1,5 a1,6 a1,8

0 0 0 a2,1 a2,3 a2,5 a2,6 a2,8

0 0 0 a3,1 a3,3 a3,5 a3,6 a3,8

0 0 0 a4,1 a4,3 a4,5 a4,6 a4,8

0 0 0 a5,1 a5,3 a5,5 a5,6 a5,8

0 0 0 a6,1 a6,3 a6,5 a6,6 a6,8

0 0 0 a7,1 a7,3 a7,5 a7,6 a7,8

0 0 0 a8,1 a8,3 a8,5 a8,6 a8,8

Figure 5.2: pex emulated by bmm

If we consider the simple bmm.8 operation, we see that it can used to emulate a number

of the subword manipulation instructions in existing SIMD instruction extensions that were

mentioned in Chapter 2. For example, the Vector Shift Right Algebraic Byte (vsrab)

POWER instruction [15], which performs an algebraic right shift on each byte in parallel,

can be performed by the multiplication using the constant Bmatrix = “C0 20 10 08 04 02 01

00” for right shift by 1, for example (Figure 5.3). bmm.8 can also perform a byte permute

instruction, which permutes, with repetition, the bytes of a word, such as the permutation

of rows of B from (1,2,3,4,5,6,7,8)→ (7,7,3,4,6,5,2,1) (Figure 5.4).

bmm.8 can also emulate POWER instructions such as Vector Rotate Left Integer Byte

(vrlb), Vector Shift Right by Octet (vsro) and Vector Splat Byte (vspltb), and IA-32

110

a1,1 a1,2 a1,3 a1,4 a1,5 a1,6 a1,7 a1,8

a2,1 a2,2 a2,3 a2,4 a2,5 a2,6 a2,7 a2,8

a3,1 a3,2 a3,3 a3,4 a3,5 a3,6 a3,7 a3,8

a4,1 a4,2 a4,3 a4,4 a4,5 a4,6 a4,7 a4,8

a5,1 a5,2 a5,3 a5,4 a5,5 a5,6 a5,7 a5,8

a6,1 a6,2 a6,3 a6,4 a6,5 a6,6 a6,7 a6,8

a7,1 a7,2 a7,3 a7,4 a7,5 a7,6 a7,7 a7,8

a8,1 a8,2 a8,3 a8,4 a8,5 a8,6 a8,7 a8,8

×

1 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 1 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 1 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 10 0 0 0 0 0 0 0

=

a1,1 a1,1 a1,2 a1,3 a1,4 a1,5 a1,6 a1,7

a2,1 a2,1 a2,2 a2,3 a2,4 a2,5 a2,6 a2,7

a3,1 a3,1 a3,2 a3,3 a3,4 a3,5 a3,6 a3,7

a4,1 a4,1 a4,2 a4,3 a4,4 a4,5 a4,6 a4,7

a5,1 a5,1 a5,2 a5,3 a5,4 a5,5 a5,6 a5,7

a6,1 a6,1 a6,2 a6,3 a6,4 a6,5 a6,6 a6,7

a7,1 a7,1 a7,2 a7,3 a7,4 a7,5 a7,6 a7,7

a8,1 a8,1 a8,2 a8,3 a8,4 a8,5 a8,6 a8,7

Figure 5.3: Vector Shift Right Algebraic Byte (vsrab) emulated by bmm

0 0 0 0 0 0 1 00 0 0 0 0 0 1 00 0 1 0 0 0 0 00 0 0 1 0 0 0 00 0 0 0 0 1 0 00 0 0 0 1 0 0 00 1 0 0 0 0 0 01 0 0 0 0 0 0 0

×

b1,1 b1,2 b1,3 b1,4 b1,5 b1,6 b1,7 b1,8

b2,1 b2,2 b2,3 b2,4 b2,5 b2,6 b2,7 b2,8

b3,1 b3,2 b3,3 b3,4 b3,5 b3,6 b3,7 b3,8

b4,1 b4,2 b4,3 b4,4 b4,5 b4,6 b4,7 b4,8

b5,1 b5,2 b5,3 b5,4 b5,5 b5,6 b5,7 b5,8

b6,1 b6,2 b6,3 b6,4 b6,5 b6,6 b6,7 b6,8

b7,1 b7,2 b7,3 b7,4 b7,5 b7,6 b7,7 b7,8

b8,1 b8,2 b8,3 b8,4 b8,5 b8,6 b8,7 b8,8

=

b7,1 b7,1 b7,2 b7,3 b7,4 b7,5 b7,6 b7,7

b7,1 b7,1 b7,2 b7,3 b7,4 b7,5 b7,6 b7,7

b3,1 b3,1 b3,2 b3,3 b3,4 b3,5 b3,6 b3,7

b4,1 b4,1 b4,2 b4,3 b4,4 b4,5 b4,6 b4,7

b6,1 b6,1 b6,2 b6,3 b6,4 b6,5 b6,6 b6,7

b5,1 b5,1 b5,2 b5,3 b5,4 b5,5 b5,6 b5,7

b2,1 b2,1 b2,2 b2,3 b2,4 b2,5 b2,6 b2,7

b1,1 b1,1 b1,2 b1,3 b1,4 b1,5 b1,6 b1,7

Figure 5.4: byte permute emulated by bmm

instructions such as Byte Reverse (bswap). In fact by multiplying by a constant matrix on

either the right or the left, bmm.8 can do any one of 265 different operations!

Overall, we see that bmm is a powerful and versatile data manipulation primitive.

Combined Operations – bmm.n can also perform multiple combined operations at once.

If the B matrix is decomposed in the following way:

B =

B1 0

0 B2

where B1 and B2 are two n/2×n/2 matrices, then bmm.n performs 2 bmm.n/2 operations

on each half of the row of A. We can have more than two matrices and their sizes do not

need to be equal. bmm.n is also able to permute a subset of bits and XOR with another

subset of bits at the same time. We obtain a combination of permutation and a XOR with a

111

matrix defined by:

B =

0 B1

0 B2

.

If B1 corresponds to a circular rotation and B2 is the identity matrix, then the left subword

of each rows of A is rotated and then XORed with right subword of the row. Thus, we

obtain the rorx and rolx instructions proposed in [19], which rotate one operand and

then XOR it with the second.

5.3 Computing Bit Matrix Multiply using Existing ISA

Before considering whether new ISA support is needed, it is important to see how fast a bit

matrix multiplication can be done using existing instructions. Hence, we first investigate

the fastest ways to compute a (64 × 64) × (64 × 64) → (64 × 64) matrix multiply using

existing instructions. The two methods are listed in Table 5.2, where they are compared to

the Cray implementation.

5.3.1 Conditional Row-wise Summation

The first method of computing bmm is based on conditionally XORing together the rows

of B based on the value of the bits of a row of A:

(a1 a2 . . . bn

)×

b1,1 b1,2 b1,n

b2,1 b2,2 b2,n

. . .

bn,1 bn,2 bn,n

⇔

a1 ×(b1,1 b1,2 . . . b1,n

)⊕

a2 ×(b2,1 b2,2 . . . b2,n

)⊕

...

an ×(bn,1 bn,2 . . . bn,n

)

112

The right hand side is equivalent to a sequence of

if ai {c = c XOR Bi},

where Bi is the ith row of B. This is repeated for each row of the A matrix. Table 5.2 shows

the pseudo code for this computation of bmm. In ISAs that support conditional moves or

predication, the branches can be eliminated.

Note that this method uses the standard definition of matrix multiplication and not the

transpose definition. Consequently, if the transpose version is desired, the B matrix must be

transposed in software first. If B is known ahead of time then there is no runtime penalty.

Lee [5] shows how a n × n matrix is transposed with only n × lg(n) mix instructions,

similar to Knuth’s Algorithm [83]. Unfortunately, no ISA has bit level mix instructions,

though our new functional unit described in Chapter 4 does, so these must be emulated

using and, shift and or instructions.

5.3.2 Table Lookup

A second fast method to compute bmm is to use lookup tables. A set of eight 256-entry

tables is constructed, based on a given 64× 64 bit B matrix. The B matrix is divided into 8

groups of 8 rows, and each table contains the XOR sum of the 256 possible combinations

of 8 rows B8i to B8i+7, for 0 ≤ i ≤ 7. These tables can be produced quite efficiently by

stepping through the 256 possible combinations in Gray code order. As sequential values in

Gray code order differ by one bit, the table entries corresponding to those values differing

by a single XOR. Thus a sequence of 256 XOR and 256 store instructions is all that is

necessary to compute each table.

Once the tables are computed, the eight bytes of a row of A are used as indices into

these eight tables and the results of the eight table lookups are XORed together to produce

a row of C (see Table 5.2). This method, like the previous one, uses the standard definition

of matrix multiplication and not the transpose definition and thus, if the transpose version

113

is desired, the B matrix must be transposed in software first.

5.3.3 Implementation of bmm on Cray Vector Supercomputers

The challenge in implementing bmm is that each row of the output matrix C depends on a

row of A and the entire B matrix. This can be seen from the definition of bmm (or bmmt) –

each bit of C depends on a full column (or row) of B. Consequently, Cray supercomputers

define a special 64×64 bmm register to hold the B matrix [17]. The contents of this register,

the entire B matrix, can be read in a single cycle. The Cray ISA defines an additional

instruction to transmit the contents of a vector register to the bmm register. Subsequently,

bmm instructions will stream the rows of A from a normal vector register, multiply each

row by the entire B matrix in a single cycle and then stream the rows of C back to a normal

vector register. This approach is also compatible with vector register lanes – the contents

of the bmm register are provided to each lane in every cycle. If there are k lanes, then

bmm takes 64/k cycles on Cray (plus vector register/functional unit overhead). With k = 2

lanes, a bmm on Cray takes only 32 cycles (not counting loading of the B matrix), which is

much faster than the software methods proposed above using existing ISA (see Table 5.2).

5.3.4 Comparison of Existing ISA Methods

The methods described above each have advantages and disadvantages. The Cray approach

achieves the highest performance at the cost of 64 64-bit words (4096 bits) of extra storage

in the microprocessor core. Additionally, if the B matrix changes, then the special bmm

register must be reloaded.

The table lookup method is the fastest (non-Cray) algorithm if there are no data cache

misses. However, the tables require 16KB of memory (8 tables × 256 entries/table ×

8 bytes/entry), and some microprocessors only have 16KB or less of L1 data cache. If

the tables are evicted from or do not fit in L1 (or higher cache levels), execution time

114

Table 5.2: Instruction sequences and counts for 3 methods to compute a bit matrix multiply usingexisting ISA

Row-wise Summation Table Lookup CrayInstruction c = 0 c = 0 Vc = bmm VaSequence mov PR = a for i from 0 to 7

for i from 1 to n c = c XOR(PR[n− i]) c = c XOR Bj Table[i][a8i+1...8(i+1)]

Instructions used 1 clear, 1 ld, 1 mov PR, 64xor, 1 st

1 ld, 8 extract byte (7shift, 8 and), 8 indexed ld,7 xor, 1 st

1 ld, 1 bmmt, 1st

Instruction Countfor 1 row

68 32 -

Instruction Countfor 64 rows

64× 68 64× 32 3

Additional ISA orresources required

Predicated execution in ISA 16 Kbytes of table space Vector ISA,large bmmfunctional unit

is negatively impacted due to cache miss penalties. Furthermore, if a new B matrix is

generated dynamically, the tables must be recomputed (and the B matrix must be transposed

if bmmt is required). While a fast algorithm for computing the tables was given, there is

still considerable overhead. Even if the B matrix is simply switched to another known

B matrix, a new set of precomputed tables must be loaded into cache from memory at

considerable cost.

Row-wise XOR depends on an ISA which supports conditional moves or predicated

execution to be implemented efficiently (with code for predicated execution shown in Ta-

ble 5.2). Even with an efficient implementation, it is still slower than table lookup. How-

ever, this method is unaffected by a dynamic B matrix as the algorithm processes the rows

of B as is, unless bmmt is required, in which case there is the overhead of transposing the

B matrix.

115

5.4 New Architectural Techniques to Accelerate Bit Ma-

trix Multiply

Since the existing methods of emulating bmm.64 are much slower that the Cray supercom-

puter, we investigate new instruction primitives suitable for commodity microprocessors

that can significantly accelerate bmm operations. Our analysis focuses on the case of the

(N × 64)× (64× 64)→ (N × 64) matrix multiply. That is, we consider multiplication of

N vectors by a fixed 64× 64 matrix B, where N is very large, performed as a sequence of

(64× 64)× (64× 64)→ (64× 64) multiplies. We propose two techniques, the first based

on submatrix multiplication and the second based on a parallel table lookup instruction.

5.4.1 Submatrix Multiplication

Direct multiplication of large matrices requires a stateful function unit – the functional unit

contains extra register storage to hold one of the operands. This extra storage plus the 2

general purpose register operands hold the larger matrices. The Cray bmmt.64 functional

unit, as discussed in Section 5.3, has an embedded vector register (with 64 64-bit entries)

to hold the B matrix.

In the ideal case, our bmm functional unit contains enough embedded storage to hold

a 64 × 64 bit matrix, like Cray. This allows for the full set of bit manipulation and parity

operations to be performed on 64-bit processor words. However, practically, such a large

unit may not be desirable for a commodity microprocessor. Instead, we investigate the use

of smaller submatrices that are directly multiplied by the bmm primitive instruction, and

we employ matrix blocking [84], to perform the larger matrix multiplies. For example,

Figure 5.5 shows how to compute bmm.64 using bmm.8.

For a given amount of embedded storage, multiple submatrix sizes are possible. Con-

sider a functional unit with 256 bits of storage. This can hold a 64 × 4, 32 × 8, 16 × 16,

116

A,B,C are 8× 8 matrices of 8-bit × 8-bit blocksfor i from 1 to 8for j from 1 to 8Ci,j = 0for k from 1 to 8Ci,j = Ci,j XOR bmm.8(Ai,k, Bk,j)

Figure 5.5: bmm.64 via multiplication of (8× 8)× (8× 8) submatrices

8 × 32 or 4 × 64 B matrix (see Table 5.4, column 2). The best matrix size depends on a

number of factors:

1. The number of terms computed in the matrix multiplication;

2. The constraint of only two input registers (128 bits) and one result register (64 bits)

per one instruction;

3. The overhead of accumulation;

4. The overhead incurred in packing and resizing matrices;

5. A preference for square matrices to best facilitate bit manipulation operations.

The number of terms computed in the multiplication of a (n × k) × (k × m) matrix

multiplication is nkm. Maximizing nkm likely yields the fastest matrix multiplication

when the bmm instruction is used as the primitive operation in block multiplication. The

sizes of n, k and m are constrained by the number of input and output registers and the

amount of embedded storage:

• Input general purpose registers constraint: nk ≤ 128 (ideally, nk = 128)

• Input embedded storage constraint: km = size of embedded storage

• Output registers constraint nm ≤ 64 (ideally, nm = 64)

117

Table 5.3: Matrix sizes for no embedded storage

nk ≤ 64 km = 64 nm ≤ 64 nkm #Multiplications for (64× 64)× (64× 64)1× 64 64× 1 1× 1 64 40962× 32 32× 2 2× 2 128 20484× 16 16× 4 4× 4 256 10248× 8 8× 8 8× 8 512 5124× 4 4× 16 4× 16 256 10242× 2 2× 32 2× 32 128 20481× 1 1× 64 1× 64 64 4096

Given the constrains on the inputs, that nk and km are fixed, then the number of terms

computed in the matrix multiplication is:

nkm =nk2m

k=nk × km

k∝ 1

k(5.1)

On the other hand, if we consider the output constraint, that nm is fixed, then the

number of terms computed in the matrix multiplication is:

nkm = nm× k ∝ k (5.2)

Thus, from (5.1) and (5.2), we see that there are competing trends: initially reducing

k increases the number of terms produced until such time that nm, the number of output

terms, is maximized. At this point, increasing k only serves to reduce the number of terms

as no more result bits can be output. We show this explicitly in Tables 5.3 and 5.4, where

we delineate the possible values for n, k and m for the case where we use no embedded

storage and the case of 256 bits of storage. The last column of the tables shows the number

of block multiplications needed to compute a (64× 64)× (64× 64) multiplication, which

requires the computation of 643 terms.

In the case of no embedded storage, decreasing k from 64 to 8 increases the number of

terms produced by 8×. However, continuing the same trend would require multiplying a

118

Table 5.4: Matrix sizes for 256 bits of embedded storage

nk ≤ 128 km = 256 nm ≤ 64 nkm #Multiplications for (64× 64)× (64× 64)2× 64 64× 4 2× 4 512 5124× 32 32× 8 4× 8 1024 2564× 16 16× 16 4× 16 1024 2562× 8 8× 32 2× 32 512 5121× 4 4× 64 1× 64 256 1024

(16 × 4) matrix by a (4 × 16) matrix to produce a (16 × 16) matrix, which would violate

the output constraint. Thus the 8 × 8 matrix size, corresponding to the bmm.8 operation,

produces the most terms. For 256 bits of embedded storage, a similar trend is seen. How-

ever, in this case, two sets of sizes, (4 × 32) × (32 × 8) and (4 × 16) × (16 × 16), both

produce 1024 terms.

However, these sizes can be differentiated based on the need to accumulate results.

Each block multiplication requires an accumulate with a prior result (see Figure 5.5).

Consequently, a bit matrix multiply-accumulate instruction, bmm.ac, (that performs ex-

actly the inner-loop operation of Figure 5.5), greatly reduces the total number of instruc-

tions required, as 64/k partial results must be XORed together for each block. Here, the

(4 × 16) × (16 × 16) case has a distinct advantage – only a single general purpose reg-

ister input is required. We then use the second operand to input a partial C matrix and

accumulate the result of the multiplication with this partial result.

Another factor to consider is the need to pack matrices. The 64× 64 matrices are held

in 64 registers. Thus, whenever the size of the output block, nm, is less than 64, multiple

blocks must be packed to fill all 64 bits. As 64/nm output blocks are packed into a 64-bit

register, we assume that an additional 64/nm− 1 instructions are required for packing for

each of the 64 registers that hold C.

Also an issue is the size of the submatrices held in the registers. Given an instruction

that multiplies matrices with certain fixed sizes, there is no guarantee that data will be laid

out as expected. If, for the (64× 64)× (64× 64) multiplication, the A matrix is generated

119

and laid out such that each 64-bit processor word contains a 1 × 64 block of A, then the

data must be resized to 4 × 16 in one register or 4 × 32 across two registers (or 8 × 8 for

bmm.8). Similarly, the output must eventually be resized back to 1× 64 blocks.

Resizing bit matrices is equivalent to transposing a 2-D matrix of subwords. (A 1× 64-

bit block is equivalent to 8 byte elements from the same row of an 8 × 8 byte matrix.

An 8 × 8-bit block is equivalent to 8 byte elements from the same column). The mix

instruction can be used to perform the matrix transpose using Lee’s method, as mentioned

in Section 5.3.1. Figure 5.6 shows a resizing from four 1× 64 blocks to four 4× 16 blocks,

at the cost of 8 mix instructions. To resize from 1× 64 blocks to 2× 32 blocks requires 1

mix instruction per row, to 4× 16 blocks, 2 mix instructions per row, and to 8× 8 blocks,

3 mix instructions per row.

Register Contentsr1 a1,1...16 a1,17...32 a1,33...48 a1,49...64

r2 a2,1...16 a2,17...32 a2,33...48 a2,49...64

r3 a3,1...16 a3,17...32 a3,33...48 a3,49...64

r4 a4,1...16 a4,17...32 a4,33...48 a4,49...64

Layout of Submatrices1 A1,1 A1,2 A1,3 A1,4

A2,1 A2,2 A2,3 A2,4

A3,1 A3,2 A3,3 A3,4

4 A4,1 A4,2 A4,3 A4,4

after mix2.l r5 = r1, r2; mix2.r r6 = r1, r2; mix2.l r7 = r3, r4; mix2.r r8 = r3, r4:


r6 a1,17...32 a2,17...32 a1,49...64 a2,49...64

r7 a3,1...16 a4,1...16 a3,33...48 a4,33...48

r8 a3,17...32 a4,17...32 a3,49...64 a4,49...64


A2,1 A2,2 A2,3 A2,4

A3,1 A3,2 A3,3 A3,4

4 A4,1 A4,2 A4,3 A4,4

after mix4.l r1 = r5, r7; mix4.l r2 = r6, r8; mix4.r r3 = r5, r7; mix4.r r4 = r6, r8:


r2 a1,17...32 a2,17...32 a3,17...32 a4,17...32

r3 a1,33...48 a2,33...48 a3,33...48 a4,33...48

r4 a1,49...64 a2,49...64 a3,49...64 a4,49...64


A2,1 A2,2 A2,3 A2,4

A3,1 A3,2 A3,3 A3,4

4 A4,1 A4,2 A4,3 A4,4

Figure 5.6: Matrix resizing using mix.

We now summarize the total number of instructions required for each matrix size for

the two cases of no embedded storage and 256 bits of embedded storage in Tables 5.5

120

and 5.6. We see that the square matrix sizes, 8×8 and 16×16, yield the best results. These

correspond to the operations bmm.8 and bmm.16.ac, respectively.

Table 5.5: Total number of operations for no embedded storage

nk ≤ 128 km = 256 nm ≤ 64 nkm #Muls #Accumulate #Pack #Resize #Resize Totalinput output

1× 64 64× 1 1× 1 64 4096 0 4032 0 0 81282× 32 32× 2 2× 2 128 2048 1024 960 64 64 41604× 16 16× 4 4× 4 256 1024 768 192 128 128 22408× 8 8× 8 8× 8 512 512 448 0 192 192 13444× 4 4× 16 4× 16 256 1024 960 0 128 128 22402× 2 2× 32 2× 32 128 2048 1984 0 64 64 41601× 1 1× 64 1× 64 64 4096 4032 0 0 0 8128

Square matrices are also the most logical choice for supporting multimedia and bit ma-

nipulation operations, as described in Section 5.2. A k × k B matrix completely specifies

how to manipulate a k-bit row of A. A k × k A matrix completely specifies how to manip-

ulate a k-bit column of B. Thus, the bmm.8 instruction, in which both A and B are square,

allows for full bit manipulation of eight 8-bit rows or columns. The bmm.16.ac, in which

only B is square, allows for full bit manipulation on four 16-bit rows.

We note that the partitioning of the large matrix operations in multiprocessor settings

has been studied extensively [84]. In these settings, communication between processors

is costly and, therefore, reducing communication is key. Results show that often square

matrices tend to be better as these maximize the computation to communication ratio –

Table 5.6: Total number of operations for 256 bits of embedded storage

nk ≤ 128 km = 256 nm ≤ 64 nkm #Muls #Accumulate #Pack #Resize #Resize Totalinput output

2× 64 64× 4 2× 4 512 512 0 448 0 64 10244× 32 32× 8 4× 8 1024 256 128 64 64 128 6404× 16 16× 16 4× 16 1024 256 0 0 128 128 5122× 8 8× 32 2× 32 512 512 0 0 64 64 6401× 4 4× 64 1× 64 256 1024 0 0 0 0 1024

121

computation is done on all the elements of a partition (area) while communication involves

the boundary elements (perimeter) and squares maximize the ratio of area to perimeter.

When considering the implementation of a bit matrix multiply instruction within a single

processor our constraints are very different, yet nevertheless, we have obtained a similar

result.

Tables 5.5 and 5.6 omit one important factor – loading the A, B and C matrices, and

moving B to the embedded storage, when it exists. As these costs are the same irrespective

of matrix size, we do not take them into account in choosing the best size.

5.4.2 Parallel Table Lookup

In [61–63, 85], a new functional unit and a set of new instructions were proposed to ac-

celerate table lookups. These instructions were shown to be especially useful for very fast

software implementations of the AES and other block ciphers, the next generation hash

function Whirlpool, data fingerprinting and erasure codes. This functional unit (called

PTLU for Parallel Table LookUp module) contains a set of k tables that are looked up in

parallel, followed by combining the k table entries into a single result. One variant of the

PTLU instruction, denoted ptrd.x1, combines the k entries by XOR. The PTLU instruc-

tion has 2 source registers. The bytes of the first source register are used as indices into the

set of tables. The results of the table lookups are XORed together and then finally XORed

with the second source operand.

This functional unit (with 64-bit table entries) can be used to perform the bmm op-

eration. In Section 5.3, we described how bmm is performed using tables stored in cache.

These tables are instead stored in the new PTLU functional unit. The ptrd.x1 instruction

then serves as an alias for a bit matrix multiply-accumulate (bmm.ac) instruction:

Ci = ptrd.x1 Ai, 0 // ptrd.x1 is an alias for bmm.ac

Using this new functional unit reduces the cost of multiplication to a single cycle per row.

122

Additionally, as these tables are dedicated to the functional unit, and not in the data cache,

there is no possibility of penalties due to data cache misses.

If the B matrix is swapped with another known B matrix, the tables must be reloaded.

In [62], the parallel table write multiple (ptwn) instruction is used to write a single entry

across all 8 tables, loading the 64 bytes directly from memory. Reloading the entire con-

tents of all tables therefore takes 256 ptwn instructions. An alternative to reloading the

tables is to define multiple sets of tables [63]. The cost is an additional 16KB per table.

Additionally, replacement of the table contents might still occur if there are more B ma-

trices than tables provided. If the B matrix is dynamically generated, the tables must be

recomputed (see Section 5.3) and then loaded into the parallel table lookup functional unit.

5.5 Comparison of Techniques

We now compare the techniques listed in Sections 5.3 and 5.4.

5.5.1 Performance

Instruction Counts – We first consider the total number of instructions for each method

(Table 5.7). For bmm.8 and bmm.16.ac, we assume that 8 and 16 rows, respectively, of

A are processed at a time to minimize shuffling of A and C from the GPRs. For each set of

rows, the entire B matrix must be loaded.

Table 5.7: Operation counts for bit matrix multiply techniques

Method #instructions for 64 rowsRow-wise Summation 4352Table lookup 2048PTLU 192bmm.8 1984bmm.16 1664bmm.64 (Cray) 192 (3 vector ops)

123

The most basic new technique, bmm.8, requiring no extra storage, requires slightly

fewer operations than the table lookup method. The more advanced new methods are even

better. However, we defer fully discussing the performance of the new techniques until

Chapter 6, where we present quantitative results from simulation.

Amortization – Another factor to consider is that bmm.8 or bmm.16.ac require mul-

tiplication of 8 or 4 rows, respectively, in a single instruction and it is only by amortizing

the cost of multiplication, accumulation and matrix resizing across multiple rows that these

instructions achieve good overall results. When multiple rows are not available to be mul-

tiplied at once, these instructions become less attractive.

Bit Manipulation – As discussed in Section 5.2, bmm can be used for bit manipulation.

However, bmm.8 and bmm.16.ac are best suited for manipulating bits within subwords

or manipulation of subwords. These instructions are less suited for full n-bit operations.

For example, to permute 8 words requires, at most, 8 bfly and 8 ibfly instructions.

Using bmm.8, requires 48 mix instructions and some number of bmm.8 and xor instruc-

tions (for accumulation) depending on the structure of the permutation matrix (i.e., how

many 8× 8 submatrices are zero). Thus, the dedicated instructions appear to be better than

the smaller bmm variants.

On the other hand, bmm.64 and PTLU are well suited for full n-bit permutations and,

in fact, may be more efficient than the dedicated instructions as a single bmm.64 or ptrd

instruction accomplishes the same permutation as a bfly and an ibfly together. Fur-

thermore, when given a new permutation, deriving the permutation matrix is easier than

deriving the bfly and ibfly control bits. However, the dedicated instructions have less

of a setup cost as only a small number of application registers need to be loaded as opposed

to loading a much larger B matrix or a full 16KB of tables.

124

Table 5.8: Overhead for bit matrix multiply techniques

Method Overhead for changing to knownB

Overhead for new B

Row-wise Summation Negligible transposing B (for bmmt)Table lookup Bringing tables into cache transposing B (for bmmt), com-

puting tablesPTLU Bringing tables into dedicated

on-chip storagetransposing B (for bmmt), com-puting tables

bmm.8 Negligible Resizing Bbmm.16 Negligible Resizing Bbmm.64 (Cray) Loading the B matrix and mov-

ing to embedded storageLoading the B matrix and mov-ing to embedded storage

5.5.2 Overhead

The operation counts listed in Table 5.7 are valid under the assumption that the B matrix

does not change. If the B matrix is generated dynamically or even if the B matrix simply

changes between known matrices, there is considerable overhead, which varies consider-

ably depending on the technique (Table 5.8). The worst overhead when switching to a

known B matrix is potentially experienced by the table lookup methods. A considerable

amount of data is brought into cache or moved to the tables upon a B matrix switch. For

bmm.64, the B matrix is loaded and moved to embedded storage. As bmm.64 generally

requires no reloading of B, changing B introduces non-trivial overhead. For row-wise sum-

mation, bmm.8 and bmm.16.ac, there is negligible overhead as the B submatrices are

constantly reloaded into GPRs anyway, so switching the B matrix has little cost.

When switching to a newly generated B matrix, the table lookup methods fare poorly as

the tables must be recomputed. However, the submatrix multiply instructions only reshape

the matrices, which can be done relatively efficiently.

125

5.5.3 Implementation Cost

We now consider the costs of the implementations of the various methods. Table 5.9 lists

how much extra storage each method needs. We also synthesized the various new func-

tional units mapping to a TSMC 90 nm standard cell library [77]. A reference ALU was

synthesized and the other designs were constrained to match the latency of the ALU. For

bmm.16.ac and bmm.64, we designed two units – one unit has embedded storage and

one set of AND-XOR trees, while the second unit has embedded storage and two sets of

AND-XOR trees. Thus the second unit supports executing two bmm instructions at once

using the same B matrix.

The parallel table lookup module has approximately twice the latency of the ALU and

thus will be a 2 or 3 cycle instruction. Both bmm.64 functional units have slightly greater

latency than the ALU. Thus they might take 1 or 2 cycles depending on timing constraints.

The other functional units meet or beat the timing constraint.

When comparing the area of the functional units, the PTLU module is much larger than

the ALU (12.8×) and is more than twice as large as the bmm.64 unit which performs only

a single multiply, which itself is around 5.5× the ALU area. However, the PTLU module

can be used for other operations besides bmm and it is more useful to think of the module as

a cache than a functional unit (and it is smaller than a 16KB cache as there are no tags). The

bmm.8 and the bmm.16.ac unit that supports one multiply are smaller than the ALU. The

bmm.8 unit is only 30% the size of the ALU and, therefore, one might want to consider

adding the bmm.8 functionality to the ALU rather than introducing a new functional unit.

Note that scaling the amount of extra storage does not equivalently scale the unit size as

the majority of the area consists of combinational logic. Also note that adding a second

set of AND-XOR trees adds 45-65% area overhead, but provides for doubling the rate of

multiplication.

126

Table 5.9: Latency and area of functional units performing proposed bmm instructions.

New Extra Latency Latency Cycles Area Areafunctional unit storage (ns) (relative

to ALU)(NANDgates)

(relativeto ALU)

ALU - 0.5 1 1 7.5K 1Row-wise - - - - - -SummationTable lookup 16 KB cache - - cache - -

latencyPTLU 16 KB 1.05 2.1 2 or 3 95.9K 12.8

dedicated tablesbmm.8 - 0.371 0.74 1 2.4K 0.31bmm.16.ac 256 bits 0.5 1 1 6.3K 0.83bmm.64 (Cray) 4096 bits 0.513 1.03 1 or 2 41.3K 5.47bmm.16.ac 2× 256 bits 0.5 1 1 10.4K 1.38bmm.64 2× 4096 bits 0.527 1.05 1 or 2 59.1K 7.84

5.6 Related Work

A few other architectures also support the bit matrix multiply instruction. The Cray vector

supercomputers, as mentioned, include a bmmt.64 unit. The Cray MTA [86], which is

not a vector supercomputer but rather a massively multithreaded supercomputer has an

instruction like bmm.8 (with both OR and XOR accumulation) and an 8 × 8 bit matrix

transpose instruction. This computer, however, was not a commercial success.

A couple of proposed academic architectures also have a bmm unit. The Near-Memory

Processor [87] has a coprocessor for vector, streaming and bit manipulation operations

which is closely coupled to the onboard memory controller and has a high bandwidth to

main memory. This coprocessor has a full bmmt.64 functional unit.

In [88], a vector accelerator for conventional processors is proposed. These accelerators

have direct access to the memory controllers and communicate with the main processor via

the front side bus. The accelerators have a high number of decoupled vector lanes (> 16)

in order to maximize throughput. Each accelerator has four 64× 64 bmm registers (to hold

4 different B matrices). These registers can be used to perform bmmt.64, but, unlike

127

with the Cray, this is not a single cycle operation. Slices of the B register are fed to each

vector lane in a cycle and the standard ALU in each lane is used to perform the bitwise

and between the rows of A and B and then the xor accumulation (which also requires

transferring data between the lanes). A single vector-matrix multiplication requires 27

cycles (for 16 lanes).

In contrast to these other proposed academic architectures, we examine how to perform

bmm.64 without resorting to a full 4096 bits, or more, of extra storage, as appropriate for

a commodity processor.

5.7 Summary

In this chapter, we discussed the bit matrix multiply operation. This operation can be

used to speed up finite field multiplication, parity computation, bit matrix transpose, and

to perform a superset of the bit manipulation operations, including many of the SIMD ISA

extension subword manipulation instructions.

We presented the Cray implementation of the bmm operation, which requires 4096 bits

of extra storage. We also considered the best software method for bmm on commodity

processors, which is the table lookup method. Due to high cost of the Cray implementa-

tion and the drawbacks of table lookup, new architecture techniques to accelerate bmm in

commodity processors were explored.

The first technique is submatrix multiply – we define a set of primitive bmm instructions

that multiply submatrices of A and B. We consider instructions that either use GPRs or

auxiliary storage to hold the B submatrix. We investigated the choice of sizes for multiplier

and multiplicand matrices and determined that bmm.8, which multiplies two 8×8 matrices

(in general purpose registers), and bmm.16.ac, which multiplies a 4 × 16 matrix in a

general purpose register by a 16×16 matrix in the extra storage and then accumulates with

a 4× 16 matrix in a general purpose register, were the best choices.

128

The second technique makes use of the Parallel Table LookUp module, which uses a

processor word as in index into a set of on-chip tables. The tables are read in parallel and

the results are XORed together. A PTLU operation is equivalent to a bmm.64 operation.

We also considered the overhead and costs of the techniques. The bmm.8 unit is 30%

the size of an ALU, and thus one may consider implementing bmm.8 functionality within

the ALU. The bmm.8 unit also requires no additional architectural register state and thus

does need operating system support. Thus bmm.8 is attractive. bmm.64 and PTLU have

high implementation cost and high overhead. However, they require the fewest number of

operations to perform a (64 × 64) × (64 × 64) → (64 × 64) multiplication. In the next

chapter, we further analyze the applications that benefit from bmm (and other advanced bit

manipulation instructions) and give quantitative results for the speedup yielded by using

the new instructions.

129

Chapter 6

Applications and Performance Results

In this chapter, we detail the applications that benefit from the advanced bit manipulation

instructions introduced earlier in this thesis. Section 6.1 describes the applications in detail

and explains how the new instructions speedup the applications. Section 6.2 presents simu-

lation results demonstrating quantitatively the speedup obtained using the new instructions.

Section 6.3 summarizes the chapter.

6.1 Applications

We now describe how bfly, ibfly, pex, pdep and bmm instructions can be used in ex-

isting applications, to give speedup estimates that are currently realizable. Use of these

novel instructions in new algorithms and applications will likely produce even greater

speedup.

Table 6.1 summarizes each of the applications described below in terms of the advanced

bit manipulation instructions it uses. × indicates definite usage, (×) indicate usage in alter-

native algorithms and question marks indicate potential usage. The bfly/ibfly column

indicates use of bit permutation. The pex and pdep column indicates use of static bit

gather and bit scatter operations. The setib and setb columns indicate use of loop-

130

Table 6.1: Summary of Bit Manipulation Instruction Usage in Various Applications

Application \ Instruction bflyibfly

pex pdep setib setb pex.v pdep.v grp bmm

Binary Compression × (×)Binary Decompression × (×)LSB Encoding × × (×)Steganography Decoding × × (×)

Transfer Coding Encoding × (×)Decoding × (×)

Integer Compression × (×)Compression Decompression × (×)Binary Image Morphology × (×) (×)

Random Number Generation Von Neumann ×Toeplitz ×

Bioinformatics

Compression × (×)BLASTZ Alignment × (×)BLASTX Translation × (×)Reversal × (×)

Cryptography

Block Ciphers × × ×Stream Ciphers × ×Public Key ×Future ? ? ? ? ? ? ? ? ?

CryptanalysisLinear ×∗Algebraic ×Future/Proprietary ? ? ? ? ? ? ? ? ?

DARPA HPCS Discrete Math Matrix Transpose ×∗Benchmarks Equation Solving ×Linear Feedback Shift Registers ×

Error CorrectionBlock Codes ×Convolutional Codes ×Puncturing × × (×)

∗bmmt

invariant bit gather and scatter. The pex.v and pdep.v columns indicate use of dynamic

bit gather and scatter.

For the bmm column, × indicates usage of bmm as a multiplication primitive while (×)

indicates usage of bmm as a bit manipulation primitive, as we use (×) for bmm to indicate

those applications where another bit manipulation instruction (like parallel extract, parallel

deposit or bit permutation) can be used. The use of the mov ar instruction is assumed

(but not shown in Table 6.1) whenever the bfly, ibfly, static pex and pdep, or bmm

instructions are used.

131

6.1.1 Binary Compression and Decompression

The Itanium [13] and IA-32 [1] parallel compare instructions produce subword masks – the

subwords for which the relationship is false contain all zeros and for which the relationship

is true contain all ones. This representation is convenient for subsequent subword masking

or merging. The SPARC VIS [14] parallel compare instruction produces a bit mask of

the results of the comparisons. This representation is convenient if some decision must be

made based on the outcome of the multiple comparisons. Converting from the subword

representation to the bitmask representation requires only a single static pex instruction.

The SSE instruction pmovmskb [1], mentioned in Chapter 2, serves a similar purpose;

it creates an 8- or 16-bit mask from the most significant bit from each byte of a MMX or

SSE register and stores the result in a general purpose register. However, pex offers greater

flexibility than the fixed pmovmskb, allowing the mask, for example, to be derived from

larger subwords, or from subwords of different sizes packed in the same register.

Similarly, binary image compression performed by MATLAB’s bwpack function [89]

benefits from pex. Binary images in MATLAB are typically represented and processed as

byte arrays – a byte represents a pixel and has permissible values 0x00 and 0x01. However,

certain optimized algorithms are implemented for a bitmap representation, in which a single

bit represents a pixel.

To produce one 64-bit output word requires only 8 static pex instructions to extract 8

bits in parallel from 8 bytes and 7 dep instructions to pack these eight 8-bit chunks into one

output word (Figure 6.1). For decompression, as with the bwunpack function, 7 extr

instructions are required to pull out each byte and 8 pdep instructions to scatter the bits of

each byte to byte boundaries.

132

Figure 6.1: Compressing each input word requires 1 pex and 1 dep.

6.1.2 Least Significant Bit Steganography

Steganography [90] refers to the hiding of a secret message by embedding it in a larger,

innocuous cover message. A simple type of steganography is least significant bit (LSB)

steganography in which the least significant bits of the color values of pixels in an image,

or the intensity values of samples in a sound file, are replaced by secret message bits.

LSB steganography encoding can use a pdep instruction to scatter the secret message bits,

placing them at the least significant bit positions of every subword. Decoding uses a pex

instruction to extract the least significant bits from each subword and reconstruct the secret

message.

LSB steganography is an example of an application that utilizes the loop-invariant ver-

sions of the pex and pdep instructions. The sample size and the number of bits replaced

are not known at compile time, but they are constant across a single message. Figure 6.2

depicts an example LSB steganography encoding operation in which the 4 least significant

bits from each 16-bit sample of PCM encoded audio are replaced with secret message bits.

Figure 6.2: LSB steganography encoding (4 bits per 16-bit PCM encoded audio sample).

133

6.1.3 Transfer Coding

Transfer coding is the term applied when arbitrary binary data is transformed to a text string

for safe transmission using a protocol that expects only text as its payload. Uuencoding [91]

is one such encoding originally used for email or usenet attachments. In uuencoding, each

set of 6 bits is aligned on a byte boundary and 32 is added to each value to ensure the result

is in the range of the ASCII printable characters. Without pdep, each field is individually

extracted and has the value 32 added to it. With pdep, 8 fields are aligned at once and

a parallel subword add instruction, available in multimedia extensions of all commerical

microprocessor ISAs [1, 13–15], adds 32 to each byte simultaneously (Figure 6.3, shown

for 4 bytes). Similarly, for decoding, a parallel subtract instruct deducts 32 from each byte

and then a single pex compresses eight 6-bit fields.

Figure 6.3: Uuencode of ’bit’ using pdep.

6.1.4 Integer Compression

Internet search engine databases, such as Google’s, consist of lists of integers describing

frequency and positions of query terms. These databases are compressed to minimize stor-

age space, memory bus traffic and cache usage. To support fast random access into these

databases, integer compression – using less than four bytes to represent an integer – is

utilized. One integer compression scheme is a variable byte encoding in which each byte

contains 7 data bits and a flag bit indicating whether the current byte is an intermediate part

of the current integer or ends the current integer [92]. pex can accelerate the decoding

134

of integers by compacting 7-bit fields from each byte (Figure 6.4), and pdep can speedup

encoding by placing consecutive 7-bit chunks of an integer on byte boundaries.

Figure 6.4: Decompression of integer compressed in 3 bytes.

6.1.5 Binary Image Morphology

Binary image morphology is a collection of techniques for binary image processing such

as erosion, dilation, opening, closing, thickening, thinning, etc. The bwmorph function in

MATLAB [89] implements these techniques primarily through one or more table lookups

applied to the 3 × 3 neighborhood surrounding each pixel (i.e. the value of 9 pixels is

used as index into a 512 entry table). In its current implementation, bwmorph processes

byte array binary images, not bitmap images, possibly due to the difficulty in extracting the

neighborhoods in the bitmap form.

If the images are processed in bitmap form, a single pex instruction extracts the entire

index at once (assuming a 64-bit word contains an 8 × 8 block of 1-bit pixels, as in Fig-

ure 6.5). As the algorithm steps through the pixels, the neighborhood moves. This means

that the bitmask for extracting the neighborhood is shifted and a dynamic pex.v instruc-

tion may be needed. Alternatively, the data might be shifted, rather than the mask, such

that the desired neighborhood always exists in a particular set of bit positions. In this case,

the mask is fixed, and only a static pex is needed. Table 6.1 indicates the latter – with pex

indicated by a × and pex.v by (×) for the alternative algorithm.

135

Figure 6.5: Using pex to extract a 3 × 3 neighborhood of pixels from register containing an 8 × 8block of pixels.

6.1.6 Random Number Generation

Random numbers are very important in cryptographic computations for generating nonces,

keys, random values, etc. Random number generators contain a source of randomness (such

as a thermal noise generator) and a randomness extractor that transforms the randomness

so that it has a uniform distribution. The Intel random number generator [93] uses a von

Neumann extractor. This extractor breaks the input random bits, X = x1x2x3 . . ., into a

sequence of pairs. If the bits in the pair differ, the first bit is output. If the bits are the same,

nothing is output. This operation is equivalent to using a pex.v instruction on each word

X from the randomness pool with the mask:

Mask = x1 ⊕ x2||0||x3 ⊕ x4||0|| . . ..

Another randomness extractor convolves the input random bits with a random string [94].

Convolution of two binary strings is equivalent to multiplication by a bit matrix with con-

stant diagonals,a binary Toeplitz matrix (Figure 6.6). This operation can be accelerated by

the bmm instruction.

(r1 r2 r3 r4 r5 r6 r7 r8

)×

s8 s9 s10 s11s7 s8 s9 s10s6 s7 s8 s9s5 s6 s7 s8s4 s5 s6 s7s3 s4 s5 s6s2 s3 s4 s5s1 s2 s3 s4

Figure 6.6: Randomness extraction by Toeplitz matrix multiplication.

136

6.1.7 Bioinformatics

Bioinformatics, the field of analysis of genetic information, was mentioned in Section 2.4.

We saw that pex and pdep are useful for a number of subproblems within bioinformatics:

• Compression: DNA bases A, C, G and T have a 2-bit representation using bits 2 and

1 of their ASCII codes. A pex instruction is used to select and compress bits 2 and

1 of each byte of a word.

• BLASTZ Alignment: The BLASTZ sequenece alignment program uses a “spaced

seed” as the basis of the comparisons – 12 out of 19 nucleotides are used to start a

comparison rather than a string of consecutive nucleotides. The program compresses

the 12 bases and uses the seed as an index into a hash table. This compression uses a

pex instruction selecting 24 of 38 bits.

• BLASTX Translation: Multiple sets of three bases, or 6 bits of data, are translated

into the corresponding protein codon using a pdep instruction to distribute eight

6-bit fields on byte boundaries, and then using PTLU to translate the bytes.

Permutation instructions can help bioinformatics as well. A strand of DNA is a double

helix – there are really two strands with the complementary nucleotides, A↔T and C↔G,

aligned. When performing analysis on a DNA string, often the complementary string is

analyzed as well. To obtain the complementary string, the bases are complemented and the

entire string is reversed, as the complement string is read from the other end. The reversal

of the DNA string amounts to a reversal of the ordering of the pairs of bits in a word. This

is a straightforward bfly or ibfly permutation.

6.1.8 Cryptology

Cryptology includes cryptography and cryptanalysis. A number of popular block ciphers,

such as DES, have permutations as primitive operations. The inclusion of permutation

137

instructions such as bfly and ibfly (or grp) can greatly improve the performance of

the inner loop of these functions [20, 21, 26, 43]. Also, these instructions can be used

as powerful primitive operations in the design of the next generation of ciphers [55, 95]

and hash functions (especially for the Cryptographic Hash Algorithm Competition (SHA-

3) [96]).

Block ciphers can also benefit from bmm. AES, for example, uses finite field multipli-

cation for its column mix step [49]. In an implementation of AES that performs each step

of the cipher explicitly, bmm can be used to accelerate column mixing.

The proposed instructions also speed up stream ciphers, as discussed in Chapter 2. The

basis of many stream ciphers are Linear Feedback Shift Registers. These can be sped up

using bmm (see Section 6.1.10). Also, some stream ciphers make use of variable pex.v

operations in order to extract the keystream from the state.

The McEliece cryptosystem [97] is a public-key cipher that uses a block code as its

basis and thus can be sped up using bmm. The private key consists of G, a k × n generator

matrix for a block code that corrects t errors, S, a random k × k non-singular matrix, and

P , an n× n permutation matrix. The public key is (G = GSP , t). The sender of message

m of length k chooses a random string z of length n having at most t “1”’s. The ciphertext

is then computed as c = mG ⊕ z. The receiver then computes c = cP−1, uses the error

correction decoding algorithm to obtain m from c, and then computes m = mS−1. In

practice, recommended parameter sizes are n = 1024, t = 38, and k ≥ 644.

Cryptanalysis is the practice of code breaking. Linear cryptanalysis was discussed in

Chapter 2. In linear cryptanalysis, a linear approximation to a block cipher, which relates

the parity of a subset of the bits of plaintext, ciphertext and key, is determined. When at-

tempting to find a linear approximation by exhaustive search, the parity of many plaintext-

ciphertext pairs with bit positions indicated by many plaintext-ciphertext masks is com-

puted in an attempt to find a bias. The bmmt instruction can be used to obtain the parity of

multiple plaintexts or ciphertexts using many masks at once.

138

In algebraic cryptanalysis [98], the relationship between the plaintext, ciphertext and

key are expressed as a large system of linear equations, with the key bits as the unknowns.

Solving the system of equations yields bits of the key. bmm can be used to accelerate solving

the system of equations as matrix multiplication and Gaussian elimination are equivalent

problems. Also, as all of the other instructions are likely to be useful for other secret or

future cryptanalysis algorithms, we indicate them as ”?” in Table 6.1.

6.1.9 DARPA HPCS Discrete Math Benchmarks

The DARPA HPCS discrete math suite [99] contains 6 benchmarks for high performance

computing that focus on integer, bit-oriented and memory operations (as opposed to the

DARPA HPC challenge suite which focuses on floating point and memory operations).

The bmm instruction can be used to accelerate the data transposition benchmark and the

Boolean equation solving benchmark, which comprise 1/3 of the HPCS benchmarks.

In the data transposition benchmark, a stream of n-bit integers is interpreted as a series

of n × n bit matrices which must be transposed. n varies from 32 to 1024. The bmmt

operation can be used directly to speed up this benchmark.

In the Boolean equation solving benchmark, a pattern of ones, zeros and wildcards

must be located within a long bitstream. For each location that the pattern is found, the

next ∼490000 bits are interpreted as a system of 700 equations in 700 unknowns over

GF(2). The solution to the system must be computed, or it must be determined that no

solution exists. bmm can be used to accelerate solving the system of equations as matrix

multiplication and Gaussian elimination are equivalent problems.

6.1.10 Communications

Communications algorithms that are accelerated by advanced bit manipulation instructions

were discussed in Chapter 2. We briefly review the algorithms here

139

Linear feedback shift registers (LFSRs) – Linear feedback shift registers are shift reg-

isters in which the feedback is the parity of some number of the current bits. LFSRs are

used for stream ciphers, error correction, scrambling and modulation. LFSR state trans-

formation are sped up by the bmm instruction. For example, one cycle of the LFSR of

Figure 2.21, with generator polynomial x8 + x2 + x + 1, is given by multiplication of the

state by matrix A:

A =

0 0 0 0 0 0 0 11 0 0 0 0 0 0 10 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 1 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 1 0 00 0 0 0 0 0 1 1

In general, for an n-bit LFSR, the A matrix is given by:

A =

0 . . . 0 p1

p2

In−1

...pn

where the p vector is the coeffecients of the generator polynomial and is used to compute

the feedback and the (n − 1)-bit Identity matrix shifts the rest of the state. Note that to

compute up to k cycles of the LFSR at once, we multiply the state by Ak.

Error Correction – Error correction coding adds redundancy to the message so that the

symbols can be recovered in the face of channel noise. One basic error correction scheme

is linear block codes, in which parity bits are computed for various subsets of the bits of

a message block. The generation of multiple parity patterns can be obtained by using a

bmm of the message symbols with a generator matrix. In convolutional coding, the input

is processed as a continuous stream. One or more shift registers store some number of

previous input bits and the outputs are the parity of subsets of the stored bits and the next

input bits. Convolutional codes can also be sped up by bmm expressing the parity generation

140

in matrix form and then replicating and shifting that matrix to acheive the effect of the shift

registers.

Puncturing – In puncturing, message bits are removed after error correction to remove

some redundancy. Removing message bits is accomplsihed by a pex instruction. Depunc-

turing is performed at the decoder to put back in bits and uses a pdep instruction.

6.2 Performance Results

We coded kernels for the above applications and simulated them using the SimpleScalar

Alpha simulator [100] enhanced to recognize our new instructions by replacing unused op-

codes with our new instruction defintions. The simulator parameters are given in Table 6.2.

Figure 6.7 shows speedups for pex and pdep, and Figures 6.8 and 6.9 show results for

bmm. In each case, the results are normalized with respect to the baseline Alpha ISA in-

struction counts.

For pex and pdep we tested bit compression of a long stream of bytes; LSB steganog-

raphy coding replacing the low 4 bits of 16-bit PCM encoded audio; uuencoding using a

long stream of bytes; integer compression using a long stream of integers; binary image

morphology using a 512× 512 bit image; BLASTX translate using a long stream of DNA

bases; and random number generation using a long stream of bits from the entropy pool.

The baseline implementation used the algorithms from Hacker’s Delight [16].

Overall, the speedups over baseline range from 1.07× to 9.9×, with an average of

2.19× (1.84× for static pex and 1.93× for static pdep). The benchmark that exhibited

the greatest speedup was random number generation as performing pex.v in software is

much more expensive that performing static pex in software. Of the other benchmarks,

integer compression was sped up the least. This is due to the extra bookkeeping required

to track when compressed integers cross word boundaries. The speedup is also lower for

141

Table 6.2: SimpleScalar Parameters

instruction fetch queue 32decode width 4issue width 4ruu size 128load/store queue 32memory ports 2ALUs 4multipliers 2pex/pdep units 1bmm units 2∗

ptlu units 1static pex/pdep latency 1pex.v/pdep.v latency 3bmm latency 1ptrd latency 2L1I cache 32KBL1D cache 32KBL2 cache 500KB∗Only one set of extra registers, but two AND-XOR trees

binary image morphology due to extra work in extracting the 3 × 3 pixel neighborhoods

that cross register boundaries and is also lower for steganography as there are only 4 fields

per word, limiting the effect of the new instructions.

We also tested the BLASTZ program using the simulator. The results were highly

dependent on the inputs chosen. For shorter sequences, in the range of 104 bases, speedups

up to 1-2% were observed. For longer sequences, 105 bases, only a minor speedup was

observed – around 0.1%. Clearly, BLASTZ can benefit from pex, but this benefit can be

very minor in some cases.

For bmm, we first ran a set of artificial benchmarks. We tested a generic (N × 64) ×

(64 × 64) → (N × 64) matrix multiply for N = 64, 1024 and 65536 and multiplication

of two 1024 × 1024 matrices (one case where the B matrix is known in advance and one

where it is not). The baseline algorithm is serial table lookup. The results are shown in

Figure 6.8.

142

Figure 6.7: Speedup of applications using pex and pdep

Figure 6.8: Speedup of artificial benchmarks using bmm

143

The results for the (N × 64)× (64× 64)→ (N × 64) basic test case are similar to the

results of Table 5.7 only for N = 1024. For N = 64, the overhead of loading the B matrix

for bmm.64 and loading the tables for PTLU adversely affects performance, as indicated

in Table 5.8, while the overhead is negligible for the other cases. For N = 65536, cache

misses become significant. bmm.64 and PTLU are the most affected because they require

the fewest computation operations, thus making them most sensitive to memory latency.

For matrix multiplication using precomputed matrices (indicated by a * in Figure 6.8),

again we see the effect of the overhead for bmm.64 and PTLU. PTLU is also negatively

impacted by caches misses due to the need to store all the tables (4MB of tables). For matrix

multiplication using a new B matrix (indicated by a ** in Figure 6.8), the baseline table

lookup case becomes worse due to the need to compute the tables. This overhead is greater

than that of resizing the matrices. Consequently, bmm.8, bmm.16.ac and bmm.64 are

better relative to the baseline.

We then evaluated the use of bmm in the applications. We tested the linear cryptanalysis

algorithm of Figure 2.19 (inner loop only); random number generation using a 768 × 256

Toeplitz matrix; block coding of a long stream of bits using the Golay code; convolutional

coding using the encoder of Figure 2.23; and transpose of a 1024×1024 matrix. The results

are shown in Figure 6.9.

For linear cryptanalysis, the results are affected by the extra work required to maintain

the counts for each mask. This limits the maximum speedup attainable (ala Amdahl’s

Law). As bmmt.64 speeds up the parity computation portion the most, it most clearly

exhibits this limitation, even though N is large enough in this case to amortize the costs

of changing B. Additionally, the transpose version of bit matrix multiply is used, as we

assume the masks, plaintexts and ciphertexts are all in rows. This impacts PTLU, as the

transpose of the matrices is computed in software.

Overhead of changing matrices also impacts the results for the random number gen-

eration benchmark. As N = 768, bmm.64 has worse performance as compared to the

144

Figure 6.9: Speedup of applications using bmm

1024 × 64 multiplication or the 1024 × 1024 multiplication. For PTLU, performance is

worse than the 1024 × 64 case, due to the overhead, but better than 1024 × 1024 case as

there are fewer tables and no cache misses.

For Golay encoding, bmm.64 and PTLU slow down, not due to the overhead of chang-

ing the B matrix or tables, but rather due to the fact that the input stream must be rearranged

to be multiplied by the fixed 24× 48 generator matrix (the 12× 24 generator matrix is tiled

twice). bmm.8 gets faster in this case as the generator matrix decomposes well into the 8×8

submatrices. bmm.16.ac is slower than bmm.8, due to the sparseness of the generator

matrix (expressed as a 48 × 96 matrix to better fit with the 16 × 16 submatrices). There

are 10 multiplies for both bmm.8 and bmm.16.ac and it is not possible to fully amortize

the cost of loading the B submatrix for each bmm.16.ac.

Compared to the Golay encoder, the results for the convolutional encoder are better

for all instructions. This is due to the structure of the B matrix – there are fewer multiply

instructions required, but more table lookups, as the table lookup method requires an entire

set of 8 rows to be zero to avoid a lookup, while multiplication requires a small 8 × 8

or 16 × 16 block to be zero to skip a submatrix multiply. The structure of the matrix for

convolutional encoding decomposes better for bmm.16.ac, so it is faster than bmm.8.

145

Table 6.3: Incremental performance gain and area increase for submatrix multiply

Incremental Increase in Performance Incremental Increase in Area(versus 2× bmm.8

bmm.16.ac vs. bmm.8 1.4 2.2bmm.64 vs. bmm.8 3.2 12.3bmm.64 vs. bmm.16.ac 2.3 5.7

Additionally, the Golay encoder is a rate 1/2 encoder while the convolutional coder is a

rate 2/3. Thus, as there are fewer outputs per input for the convolutional encoder, more

input rows can be processed at a time without encountering register pressure, allowing for

greater amortization of loading the B matrix.

For transposition, bmmt.64 suffers as the B matrix is loaded every time and few mul-

tiplications are required for bmmt.8 and bmmt.16.ac as A (= I) is very sparse.

Overall, bmm.8 is 1.6× faster than the baseline case, bmm.16.ac is 2.2× faster,

bmm.64 is 5.0× faster and PTLU is 3.6× faster. However, we see that bmm.8 and

bmm.16.ac are mostly insensitive to extremes or changes in parameters. bmm.64 and

PTLU are, on the hand, highly sensitive due to the need to amortize the cost of loading the

B matrix or tables. Furthermore, the cost of computing tables is significant. We also see

sensitivity to awkward matrix sizes and sparse matrices. bmm.8 is most able to handle the

former, as long as the matrix can be decomposed into 8 × 8 submatrices, and the latter, as

few submatrices are required.

In Table 6.3, we consider the incremental performance gain and area increase going

from bmm.8 to bmm.16.ac to bmm.64 (for units that support two multiplies). Going

from bmm.8 to bmm.16.ac to bmm.64 does provide increasing performance, but at a

steeper and steeper cost. Clearly, bmm.64 has the best overall performance, but it also

has the greatest cost. The bmm.8 instruction provides reasonable performance gains, is

insensitive to parameter changes, can take advantage of sparsity, requires no extra state and

can, by itself, perform many of the existing SIMD ISA subword manipulation instructions.

146

6.3 Summary

In this chapter, we analyzed applications that benefit from the newly proposed instruc-

tions. These applications include bit compression and decompression, least significant bit

steganography, binary image morphology, transfer coding, bioinformatics, integer com-

pression, random number generation, error correcting coding, matrix transposition and

cryptology.

Our results show that the applications the benefit from pex and pdep were sped up

2.19× on average. However, only one of our benchmarks made use of loop-invariant pex

and pdep, one made use of the variable pex.v instructions and none used pdep.v.

Consequently, it may not be necessary to provide hardware support for the loop-invariant

and variable operations, especially as the hardware decoder has high cost.

The benchmarks for bit matrix multiply show that bmm.8 is 1.6× faster than the base-

line case, bmm.16.ac is 2.2× faster, bmm.64 is 5.0× faster and PTLU is 3.6× faster,

on average. bmm.64 and PTLU have the best overall performance, but they also have the

greatest cost.

147

Chapter 7

Conclusions and Future Research

This thesis presented a number of contributions concerning the design, implementation and

benefits of advanced bit manipulation instructions for commodity processors. In this chap-

ter, we summarize the work done in the thesis and propose directions for future research.

7.1 Advanced Bit Manipulation Instructions

This first contribution of the thesis is the novel parallel extract and parallel deposit instruc-

tions. In our analysis of bit-oriented instructions, we saw that specialized bit permutation

instructions were lacking. We proposed the parallel extract instruction to perform bit gather

operations and the parallel deposit instruction to perform bit scatter operations. We showed

how these instructions utilize the butterfly and inverse butterfly datapaths and developed an

algorithm for decoding the bitmask that specifies the bit gather or scatter operations into

the controls for the datapaths. We designed a hardware circuit, composed of a parallel pre-

fix population count subcircuit and a set of left rotate and complement subcircuits, which

implements our decoding algorithm.

148

7.2 Advanced Bit Manipulation Functional Unit

We designed a standalone advanced bit manipulation functional unit that implements paral-

lel extract, parallel deposit and the butterfly and inverse butterfly permutation instructions.

Due to high cost of the hardware decoder, we considered the design of two alternative units

– one that supports only static pex and pdep operations for which the datapath control

are precomputed and one that supports variable pex.v and pdep.v instructions as well

as loop invariant operations, for which the hardware decoder is used once and the control

bits are then captured and reused. The unit that supports only the static instructions is faster

(0.95× the latency) and smaller (0.9× the area) than an ALU.

7.3 Advanced Bit Manipulation Functional Unit as Basis

for the Shifter

We proposed that the advanced bit manipulation functional unit, rather than being stan-

dalone, should replace the standard shifter. We showed how rotation operations are per-

formed on butterfly and inverse butterfly datapaths and how instructions such as shift, ex-

tract, deposit and mix are just minor variations on rotation. We designed a rotation control

bit generator circuit that produces the datapath control bits from the shift amount.

This new shifter architecture represents an evolution of shifter design from the classic

barrel and log shifter. The new shifter supporting static pex and pdep is 11% slower

than and 1.8× larger than log shifter while being able to perform a much richer array of

operations. The shifter supporting variable pex.v and pdep.v is 2.1× slower and 2.4×

larger, but supports the widest range of operations.

149

7.4 Implementation of Bit Matrix Multiply in Commodity

Processors

Another contribution of the thesis consists of architectural proposals to accelerate the bit

matrix multiply operation. Bit matrix multiplication is used to speed up parity computation

and can perform a superset of the bit manipulation operations. This instruction is supported

by the Cray supercomputer and we considered alternatives suitable for a commodity pro-

cessor. One proposal is submatrix multiplication, in which the multiplicand matrix B is

stored in extra registers in the functional unit. We also considered using the Parallel Ta-

ble Lookup module, which allows for looking up multiple tables in parallel, for bit matrix

multiply, as the best current technique for bit matrix multiply is the table lookup method.

For submatrix multiplication, we investigated the choice of sizes for multiplier and

multiplicand matrices and determined that bmm.8, which multiplies two 8×8 matrices (in

general purpose registers), and bmm.16.ac, which multiplies a 4× 16 matrix in a general

purpose register by a 16×16 matrix in the extra storage and then accumulates with a 4×16

matrix in a general purpose register, were the best choices. The bmm.8 unit is 30% the

size of an ALU, and thus one may consider implementing bmm.8 functionality within the

ALU. The bmm.8 unit also requires no additional architectural register state and thus does

need operating system support.

7.5 Application Analysis

The final contribution of this thesis is the analysis of applications that benefit from bit

manipulation operations. We showed how diverse applications such as bit compression,

image manipulation, communication coding, random number generation, bioinformatics,

integer compression and cryptology are all sped up by our new instructions. Overall, the

applications that benefited from pex and pdep were sped up 2.19× on average. Of the

150

applications considered only one uses pex.v. This, together with high cost of the pex.v

functional unit, indicates that preferred functional unit should be that only supports static

pex and pdep.

The applications that benefit from bit matrix multiply were sped up 1.6× using bmm.8,

2.2× using bmm.16.ac, 5.0× using bmm.64, and 3.6× using PTLU, on average. Over-

all, bmm.64 was 3.2× faster than bmm.8 and bmm.16.ac was 1.4× faster. However,

the bmm.16.ac unit is 2.2× the size of bmm.8 and the bmm.64 unit is 12.3× the size.

Clearly, going from bmm.8 to bmm.16.ac to bmm.64 does provide increasing perfor-

mance, but at a steeper and steeper cost. To achieve the best performance, a full Cray-like

bmm.64 unit can be implemented. The PTLU unit has good performance, but is very

costly. However, it can be used for many purposes aside from bit matrix multiply.

7.6 Future Research

There are a number of possible directions for future work. The first possibility is compiler

support for bit manipulation instructions. In coding our simulations, we used compiler

intrinsics, essentially inline assembly with better syntax, to access our instructions. This

approach helps the programmer who is diligent enough to discover that the compiler sup-

ports the intrinsics. Ideally, the compiler will recognize the code sequence that is equivalent

to the instructions and automatically generate code that uses the instructions. This is likely

a difficult task, as there are many ways that these operations can be coded in software.

Other future work includes refinement of the implementation of the advanced bit manip-

ulation instructions. There may be better designs for the parallel prefix population counter

or, in fact, better ways entirely for implementing pex and pdep. Additionally, the designs

were synthesized primarily focused on minimizing latency. Other designs that consider

area and power may prove interesting to examine.

Application analysis is also likely a fruitful area for further work. We do not claim to

151

have performed an exhaustive listing of all the applications that can benefit from bit manip-

ulation instructions. There are likely other algorithms that can be sped up by our instruc-

tions. Furthermore, designing algorithms that specifically take into account the existence of

advanced bit manipulation instructions is an area that has seen little research. Conversely,

application analysis might yield other candidates for new specialized advanced permutation

instructions.

Bit matrix multiply in a multicore setting is another avenue for future research. This

area includes considering how best to decompose large matrix multiplies for parallel pro-

cessing. It is assumed that prior work in decomposing standard matrix multiplication has

relevance, but the unique overheads involved with setting up the B matrices need to be

considered. Another point to evaluate is the tradeoff between having many cores each con-

taining a small bit matrix multiply unit (such as bmm.8), versus having only some cores

contain a larger unit (such as bmm.64).

Overall, we have brought the acceleration of advanced bit manipulation operations out

of the realm of “programming tricks.” We have shown that these operations are needed by

many applications and that direct support for them can be implemented in a commodity

microprocessor at a reasonable cost.

152

Bibliography

[1] Intel Corporation, Intel 64 and IA-32 Architectures Software Developer’s Manual, vol. 1–2,

Apr. 2008.

[2] Compaq Computer Corporation, Alpha Architecture Handbook, Rev. 4.0, Oct. 1998.

[3] R. B. Lee, “Multimedia enhancements for PA-RISC processors,” in Hot Chips VI, pp. 183–

1192, Aug. 1994.

[4] R. B. Lee, “Accelerating multimedia with enhanced microprocessors,” IEEE Micro, vol. 15,

no. 2, pp. 22–32, Apr. 1995.

[5] R. B. Lee, “Subword parallelism with MAX-2,” IEEE Micro, vol. 16, no. 4, pp. 51–59, Aug.

1996.

[6] A. Peleg and U. Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Mi-

cro, vol. 16, no. 4, pp. 10–20, Aug. 1996.

[7] S. Obeman, G. Favor, and F. Weber, “AMD 3Dnow! Technology: Architecture and Imple-

mentations,” IEEE Micro, vol. 19, no. 2, pp. 37–48, Apr. 1999.

[8] S. Thakkar and T. Huff, “The Internet Streaming SIMD Extensions ,” Intel Technology Jour-

nal, vol. 3, no. 2, May 1990.

[9] M. Tremblay, J. M. O’Connor, V. Narayanan, and L. He, “VIS Speeds New Media Process-

ing,” IEEE Micro, vol. 16, no. 4, pp. 35–42, Aug. 1996.

[10] K. Diefendorff, P. K. Dubey, R. Hochsprung, and H. Scales, “AltiVec Extension to PowerPC

Accelerates Media Processing,” IEEE Micro, vol. 20, no. 2, pp. 85–95, Apr. 2000.

153

[11] R. B. Lee, A. M. Fiskiran, and A. Bubshait, “Multimedia instructions in IA-64,” in Pro-

ceedings of the IEEE International Conference on Multimedia and Expo, pp. 281–284, Aug.

2001.

[12] Gerry Kane, PA-RISC 2.0 Architecture. Prentice Hall, 1996.

[13] Intel Corporation, Intel Itanium Architecture Software Developer’s Manual, vol. 1–3, Rev.

2.2, Jan. 2006.

[14] Sun Microsystems, The VIS Instruction Set, Rev. 2.0, November 2003.

[15] IBM Corporation, Power ISA, Rev. 2.05, October 2007.

[16] H. S. Warren Jr., Hacker’s Delight. Boston: Addison-Wesley Professional, 2002.

[17] Cray Inc., “Cray Assembly Language (CAL) for Cray X1 Systems Reference Manual, ver-

sion 1.2,” 2003, http://docs.cray.com/books/S-2314-51/S-2314-51-manual.pdf.

[18] C. Hanson, “MicroUnity’s MediaProcessor Architecture,” IEEE Micro, vol. 16, no. 4, pp.

34–41, Aug. 1996.

[19] J. Burke, J. McDonald, and T. M. Austin, “Architectural Support for Fast Symmetric-Key

Cryptography,” in Architectural Support for Programming Languages and Operating Sys-

tems, ASPLOS 2000, pp. 178–189, Nov. 2000.

[20] R. B. Lee, Z. Shi, and X. Yang, “Efficient Permutation Instructions for Fast Software Cryp-

tography,” IEEE Micro, vol. 21, no. 6, pp. 56–69, Dec. 2001.

[21] Z. Shi and R. B. Lee, “Bit Permutation Instructions for Accelerating Software Cryptogra-

phy,” in Proceedings of the IEEE International Conference on Application-Specific Systems,

Architectures, and Processors (ASAP 2000), pp. 138–148, July 2000.

[22] X. Yang, M. Vachharajani, and R. B. Lee, “Fast Subword Permutation Instructions Based on

Butterfly Networks,” in Proceedings of Media Processors IS&T/SPIE Symposium on Electric

Imaging: Science and Technology, pp. 80–86, Jan. 2000.

154

[23] X. Yang and R. B. Lee, “Fast Subword Permutation Instructions Using Omega and Flip Net-

work Stages,” in Proceedings of the International Conference on Computer Design (ICCD

2000), pp. 15–22, Sept. 2000.

[24] R. B. Lee, Z. Shi, and X. Yang, “How a Processor can Permute n bits in O(1) cycles,” in

Proceedings of Hot Chips 14, Aug. 2002.

[25] Z. Shi, X. Yang, and R. B. Lee, “Arbitrary Bit Permutations in One or Two Cycles,” in

Proceedings of the IEEE International Conference on Application-Specific Systems, Archi-

tectures and Processors (ASAP), pp. 237–247, June 2003.

[26] R. B. Lee, X. Yang, and Z. J. Shi, “Single-Cycle Bit Permutations with MOMR Execution,”

Journal of Computer Science and Technology, vol. 20, no. 5, pp. 577–585, Sept. 2005.

[27] A. A. Moldovyan, N. A. Moldovyan, P. A. Moldovyanu, “Architecture Types of the Bit

Permutation Instruction For General Purpose Processors,” Springer LNG&G, vol. 14, pp.

147–159, 2007.

[28] Y. Hilewitz, Z. J. Shi, and R. B. Lee, “Comparing Fast Implementations of Bit Permutation

Instruction,” in Proceedings of the 38th Annual Asilomar Conference on Signals, Systems,

and Computers, pp. 1856–1863, Nov. 2004.

[29] R. B. Lee and Y. Hilewitz, “Fast Pattern Matching with Parallel Extract Instructions,” Prince-

ton University Department of Electrical Engineering Technical Report CE-L2005-002, Feb.

2005.

[30] Y. Hilewitz and R. B. Lee, “Fast Bit Compression and Expansion with Parallel Extract and

Parallel Deposit Instructions,” in Proceedings of the IEEE 17th International Conference on

Application-Specific Systems, Architecture and Processors (ASAP 2006), pp. 65–72, Sept.

2006.

[31] Y. Hilewitz and R. B. Lee, “Fast Bit Gather, Bit Scatter and Bit Permutation Instructions for

Commodity Microprocessors,” Journal of Signal Processing Systems, 2008.

155

[32] Y. Hilewitz and R. B. Lee, “Performing Advanced Bit Manipulations Efficiently in General-

Purpose Processors,” in Proceedings of the 18th IEEE Symposium on Computer Arithmetic

(ARITH 18), pp. 251–260, June 2007.

[33] Y. Hilewitz and R. B. Lee, “A New Basis for Shifters in General-Purpose Processors for

Existing and Advanced Bit Manipulations,” IEEE Transactions on Computers, 2008.

[34] Y. Hilewitz and R. B. Lee, “Achieving Very Fast Bit Matrix Multiplication in Commodity Mi-

croprocessors,” Princeton University Department of Electrical Engineering Technical Report

CE-L2007-006, Aug. 2007.

[35] Y. Hilewitz, C. Lauradoux, and R. B. Lee, “Fast Bit Matrix Multiplication in Commodity Mi-

croprocessors,” Princeton University Department of Electrical Engineering Technical Report

CE-L2007-011, Nov. 2007.

[36] Y. Hilewitz, C. Lauradoux, and R. B. Lee, “Bit Matrix Multiplication in Commodity Micro-

processors,” in Proceedings of the IEEE International Conference on Application-Specific

Systems, Architectures and Processors, July 2008.

[37] Intel Corporation, Intel Advanced Vector Extensions Programming Reference, Mar. 2008.

[38] Advanced Micro Devices, “AMD64 Technology: AMD64 Architecture Programmers Man-

ual,” Sept. 2007.

[39] Advanced Micro Devices, “AMD64 Technology: 128-Bit SSE5 Instruction Set,” Aug. 2007.

[40] R. B. Lee, “Efficiency of MicroSIMD Architectures and Index-Mapped Data for Media Pro-

cessors,” in Proceedings of Media Processors 1999 IS&T/SPIE Symposium on Electric Imag-

ing: Science and Technology, pp. 34–46, Jan. 1999.

[41] R. B. Lee, “Subword Permutation Instructions for Two-Dimensional Multimedia Process-

ing in MicroSIMD Architectures,” in Proceedings of the IEEE International Conference

on Application-Specific Systems, Architectures and Processors (ASAP 2000), pp. 3–14, July

2000.

156

[42] J. P. McGregor and R. B. Lee, “Architectural Enhancements for Fast Subword Permutations

with Repetitions in Cryptographic Applications,” in Proceedings of the International Con-

ference on Computer Design (ICCD 2001), pp. 453–461, Sept. 2001.

[43] J. P. McGregor and R. B. Lee, “Architectural Techniques for Accelerating Subword Permuta-

tions with Repetitions,” IEEE Transactions on Very Large Scale Integration Systems, vol. 11,

no. 3, pp. 325–335, June 2003.

[44] R. B. Lee, “Precision Architecture,” IEEE Computer, vol. 22, no. 1, pp. 78–91, Jan. 1989.

[45] R. B. Lee, M. Mahon, and D. Morris, “Pathlength Reduction Features in the PA-RISC Ar-

chitecture,” in Proceedings of IEEE Compcon, pp. 129–135, Feb. 1992.

[46] IBM Corporation, PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension

Technology Programming Environments Manual, Rev. 2.06, Aug. 2005.

[47] Z. Shi and R. B. Lee, “Subword Sorting with Versatile Permutation Instructions,” in Proceed-

ings of the International Conference on Computer Design (ICCD 2002), pp. 234–241, Sept.

2002.

[48] National Bureau of Standards, “Data Encryption Standard (DES),” Federal Information Pro-

cessing Standards Publication 46-2, Dec. 1993.

[49] National Institute of Standards and Technology, “Advanced Encryption Standard (AES),”

Federal Information Processing Standards Publication 197, Nov. 2000.

[50] C. Burwick et al, “MARS: a Candidate Cipher for AES,” NIST AES proposal, June 1998.

[51] R. L. Rivest and M. J. B. Robshaw and R. Sidney and Y. L. Yin, “The RC6 Block Cipher,”

NIST AES proposal, Aug. 1998.

[52] R. Anderson and E. Biham and L. Knudsen, “Serpent: a Proposal for the Advanced Encryp-

tion Standard,” NIST AES proposal, June 1998.

[53] B. Schneier and J. Kelsey and D. Whiting and D. Wagner and C. Hall and N. Ferguson,

“TwoFish: a 128-bit Block Cipher,” NIST AES proposal, June 1998.

157

[54] Princton Architecture Laboratory for Multimedia and Security (PALMS),

http://palms.ee.princeton.edu.

[55] R. B. Lee, R. L. Rivest, M. J. B. Robshaw, Z. J. Shi, and Y. L. Yin, “On Permutation Op-

erations in Cipher Design,” in Proceedings of the International Conference on Information

Technology (ITCC), vol. 2, pp. 569–577, Apr 2004.

[56] Z. J. Shi and R. B. Lee, “Implementation Complexity of Bit Permutation Instructions,” in

Proceedings of the 37th Annual Asilomar Conference on Signals, Systems, and Computers,

pp. 879–886, Nov. 2003.

[57] V. E. Benes, “Optimal rearrangeable multistage connecting networks,” Bell System Technical

Journal, vol. 43, no. 4, pp. 1641–1656, July 1964.

[58] Cray Inc., “Man Page Collection: Bioinformatics Library Procedures,” 2004,

http://www.cray.com/craydoc/manuals/S-2397-21/S-2397-21.pdf.

[59] S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and

W. Miller, “Human-Mouse Alignments with BLASTZ,” Genome Research, vol. 13, no. 1,

pp. 103–107, Jan. 2003.

[60] National Center for Biotechnology Information, “Translating Basic Local Alignment Search

Tool (BLASTX),” http://www.ncbi.nlm.nih.gov/blast/.

[61] A. M. Fiskiran and R. B. Lee, “Fast Parallel Table Lookups to Accelerate Symmetric-Key

Cryptography,” in Proceedings of the International Conference on Information Technology

Coding and Computing (ITCC), Embedded Cryptographic Systems Track, pp. 526–531, Apr

2005.

[62] A. M. Fiskiran and R. B. Lee, “On-Chip Lookup Tables for Fast Symmetric-Key Encryp-

tion,” in Proceedings of the IEEE International Conference on Application-Specific Systems,

Architectures and Processors (ASAP), pp. 356–363, July 2005.

158

[63] W. Josephson, R. Lee, and K. Li, “ISA Support for Fingerprinting and Erasure Codes,” in

Proceedings of the IEEE International Conference on Application-Specific Systems, Archi-

tectures and Processors (ASAP), pp. 415–422, July 2007.

[64] M. Matsui, “Linear Cryptanalysis Method for DES Cipher,” in Advances in Cryptology -

EUROCRYPT ’93, Lecture Notes in Computer Science 765, pp. 386–397. Springer Verlag,

1994.

[65] J. Daemen and V. Rijmen, “The Wide Trail Design Strategy,” in Proceedings of the 8th IMA

International Conference on Cryptography and Coding, Lecture Notes in Computer Science

2260, pp. 222–238, 2001.

[66] Y. L. Yin, Private Correspondence.

[67] Y. Lin, H. Lee, M. Woh, Y. Harel, S. A. Mahlke, T. N. Mudge, C. Chakrabarti, and K. Flaut-

ner, “SODA: A High-Performance DSP Architecture for Software-Defined Radio,” IEEE

Micro, vol. 27, no. 1, pp. 114–123, 2007.

[68] E. Blossom, “The GNU Software Defined Radio,”

http://www.gnu.org/software/gnuradio/gnuradio.html.

[69] S. Mamidi, E. R. Blem, M. J. Schulte, C. J. Glossner, D. Iancu, A. Iancu, M. Moudgill, and

S. Jinturkar, “Instruction Set Extensions for Software Defined Radio on a Multithreaded Pro-

cessor,” in International Conference on Compilers, Architecture, and Synthesis for Embedded

Systems - CASES 2005, pp. 266–273, 2005.

[70] B. Schneier, Applied Cryptography. John Wiley & Sons, 1996.

[71] D. Coppersmith, H. Krawczyk, and Y. Mansour, “The Shrinking Generator,” in Advances in

Cryptology - CRYPTO ’93, Lecture Notes in Computer Science 773, pp. 22–39, 1993.

[72] W. Meier and O. Staffelbach, “The self-shrinking generator,” in Advances in Cryptology -

EUROCRYPT ’94, Lecture Notes in Computer Science 950, pp. 205–214, 1994.

[73] M. Kanemasu, “Golay Codes,” MIT Undergraduate Journal of Mathematics, no. 1, June

1999.

159

[74] “Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specification,”

IEEE Std. 802.11g-2003.

[75] E. E. Swartzlander Jr., “A Review of Large Parallel Counter Designs,” in IEEE Symposium

on VLSI, pp. 89–98, Feb. 2004.

[76] T. Han and D. Carlson, “Fast area-efficient VLSI adders,” in Proceedings of the 8th Sympo-

sium on Computer Arithmetic, pp. 49–55, May 1987.

[77] Taiwan Semiconductor Manufacturing Corporation, “TCBN90GTHP: TSMC 90nm Core Li-

brary Databook, version 1.1,” 2006.

[78] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hyper-

cubes. San Francisco: Morgan Kaufmann Publishers, 1992.

[79] M. R. Pillmeier, M. J. Schulte, and E. G. Walters III, “Design Alternatives for Barrel

Shifters,” in Advanced Signal Processing Algorithms, Architectures, and Implementations

XII, Proceedings of the SPIE, vol. 4791, pp. 436–447, Dec. 2002.

[80] I. Sutherland, B. Sproull, and D. Harris, Logical Effort: Designing Fast CMOS Circuits. San

Francisco: Morgan Kaufmann Publishers, 1999.

[81] R. B. Lee, M. Fiskiran, M. Wang, Y. Hilewitz, and Y.-Y. Chen, “PAX: A Cryptographic

Processor with Parallel Table Lookup and Wordsize Scalability,” Princeton University De-

partment of Electrical Engineering Technical Report CE-L2007-010, Nov. 2007.

[82] W. Lee, G. J. Geissler, S. J. Johnson, and A. J. Schiffleger, “Vector Bit-Matrix Multiply

Function Unit,” US Patent 5,170,370, 1996.

[83] D. E. Knuth, The Art of Computer Programming, 2nd ed. Addison-Wesley, 1978.

[84] D. Culler, J. Singh, and A. Gupta, Parallel Computer Architecture: A Hardware-Software

Approach. Morgan Kaufmann Publishers, 1998.

160

[85] Y. Hilewitz, Y. L. Yin, and R. B. Lee, “Accelerating the Whirlpool Hash Function using

Parallel Table Lookup and Fast Cyclical Permutation,” in Proceedings of 15th Fast Software

Encryption Workshop (FSE), pp. 173–188, Feb. 2008.

[86] Cray Inc., “CRAY/MTA Principles of Operation,” 2005.

[87] M. Wei, M. Snir, J. Torrellas, and R. B. Tremaine, “A Near-Memory Processor for Vector,

Streaming and Bit Manipulation Workloads,” in Watson Conference on Interaction between

Architecture, Circuits, and Compilers (P=AC2), Sept. 2005.

[88] C. Lemuet, J. Sampson, J.-F. Collard, and N. P. Jouppi, “The Potential Energy Efficiency of

Vector Acceleration,” in Proceedings of Supercomputing 2006 (SC06), Nov. 2006.

[89] The Mathworks, Inc., “Image processing toolbox user’s guide,”

http://www.mathworks.com/access/helpdesk/help/toolbox/images/images.html.

[90] E. Franz, A. Jerichow, S. Moller, A. Pfitzmann, and I. Stierand, “Computer Based Steganog-

raphy,” Information Hiding, Springer Lecture Notes in Computer Science, vol. 1174, pp.

7–21, 1996.

[91] “Uuencode,” Wikipedia: The Free Encyclopedia, http://en.wikipedia.org/ wiki/Uuencode.

[92] F. Scholer, H. Williams, J. Yiannis, and J. Zobel, “Compression of Inverted Indexes For Fast

Query Evaluation,” in Proceedings of the 25th Annual International ACM SIGIR Conference

on Research and Development in Information Retrieval, pp. 222–229, 2002.

[93] B. Jun and P. Kocher, “The Intel Random Number Generator,” Cryptography Research Inc.,

1999.

[94] B. Barak, R. Shaltiel, and E. Tromer, “True Random Number Generators Secure in a Chang-

ing Environment,” in Cryptographic Hardware and Embedded Systems (CHES), Lecture

Notes in Computer Science 2779, pp. 166–180. Springer Verlag, Sept. 2003.

[95] N. A. Moldovyan, P. A. Moldovyanu, and D. H. Summerville, “On Software Implementation

of Fast DDP-based Ciphers,” International Journal of Network Security, vol. 4, no. 1, pp.

81–89, Jan. 2007.

161

[96] NIST, “Hash Function Main Page,” http://www.nist.gov/hash-competition.

[97] A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone, Handbook of Applied Cryptography.

CRC Press, 1996.

[98] N. Courtois and W. Meier, “Algebraic Attacks on Stream Ciphers with Linear Feedback,” in

Advances in Cryptology - EUROCRYPT 2003, Lecture Notes in Computer Science 2656, pp.

345–359. Springer Verlag, 2003.

[99] Defense Advanced Research Projects Agency, “High productivity computing systems dis-

crete mathematics benchmarks,” 2003.

[100] T. M. Austin, E. Larson, and D. Ernst, “Simplescalar: An infrastructure for computer system

modeling,” IEEE Computer, vol. 35, no. 2, pp. 59–67, 2002.

162

A BIT MANIPULATION : ARCHITECTURE APPLICATIONSzshi/course/cse5302/ref/yhilewitz_thesis.pdf · vanced bit manipulation functional unit to support bit gather, bit scatter and bit permutation

Documents