A DVANCED B IT M ANIPULATION I NSTRUCTIONS :A RCHITECTURE , I MPLEMENTATION AND A PPLICATIONS Y EDIDYA H ILEWITZ ADISSERTATION P RESENTED TO THE FACULTY OF P RINCETON UNIVERSITY IN CANDIDACY FOR THE DEGREE OF DOCTOR OF P HILOSOPHY RECOMMENDED FOR ACCEPTANCE BY THE DEPARTMENT OF ELECTRICAL ENGINEERING ADVISER:RUBY B. LEE S EPTEMBER 2008
171
Embed
A BIT MANIPULATION : ARCHITECTURE APPLICATIONSzshi/course/cse5302/ref/yhilewitz_thesis.pdf · vanced bit manipulation functional unit to support bit gather, bit scatter and bit permutation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ADVANCED BIT MANIPULATION
INSTRUCTIONS: ARCHITECTURE,IMPLEMENTATION AND APPLICATIONS
56]. It conserves some of the most useful properties of grp, while being easier to imple-
ment: the bits that would have been gathered to the left are instead zeroed out.
3.1.2 Bit Scatter or Parallel Deposit (pdep)
Bit scatter takes the right-aligned, contiguous bits in a register and scatters them in the
result register according to a mask in a second input register. This is the reverse operation
to bit gather. We also call bit scatter a parallel deposit instruction, or pdep, because it
is like a parallel version of the deposit (dep) instruction. Figure 3.2 compares dep and
pdep. The deposit (dep) instruction takes a right justified field of bits from the source
register and deposits it at any single position in the destination register. The parallel deposit
(pdep)instruction takes a right justified field of bits from the source register and deposits
the bits in different non-contiguous positions indicated by a bit mask.
(a) (b)
Figure 3.2: (a) dep r1 = r2, pos, len (b) pdep r1 = r2, r3
41
Figure 3.3: Labeling of butterfly network.
3.2 Datapaths for Parallel Extract and Parallel Deposit
It is not intuitively clear that pex and pdep are easy to implement, especially in a single
cycle. In this section, we show that parallel deposit can be mapped to the butterfly network
and that parallel extract can be mapped to the inverse butterfly network. This provides the
basis for a single functional unit that performs bit gather, bit scatter and bit permutation
operations using both butterfly and inverse butterfly datapaths.
3.2.1 Parallel Deposit on the Butterfly Network
We first show an example parallel deposit operation on the butterfly network. Then, we
show that any parallel deposit operation can be performed using a butterfly network.
Figure 3.3 shows our labeling of the left (L) and right (R) halves of successive stages
of a butterfly network. Since we often appeal to induction for successive stages, we usually
omit the subscripts for these left and right halves.
Figure 3.4(a) shows an example pdep operation with mask = 10101101. Figure 3.4(b)
42
(a)
(b) (c)
Figure 3.4: (a) 8-bit pdep operation (b) mapped onto butterfly network with explicit right rotationsof data bits between stages and (c) without explicit rotations of data bits by modifying the controlbits.
43
shows this pdep operation broken down into steps on the butterfly network. In the first
stage, we transfer from the right (R) to the left half (L) the bits whose destination is in L,
namely bits d and e. Prior to stage 2, we right rotate e00d by 3, the number of bits that
stayed in R, to right justify the bits in their original order, 00de, in the L half. Note that
bits that stay in the right half, R, are already right-justified.
In each half of stage 2 we transfer from the local R to the local L the bits whose final
destination is in the local L. So in R, we transfer g to RL and in L we transfer d to LL. Prior
to stage 3, we right rotate the bits to right justify them in their original order in their new
subnetworks. So d0 is right rotated by 1, the number of bits that stayed in LR, to yield 0d
and gf is right rotated by 1, the number of bits that stayed in RR, to yield fg.
In each subnetwork of stage 3 we again transfer from the local R to the local L the bits
whose final destination is in the local L. So in LL we transfer d and in LR we transfer e.
After stage 3 we have transferred each bit to its correct final destination: d0e0fg0h. Note
that we use a control bit of “0” to indicate a pass through operation and a control bit of “1”
to indicate a swap.
Rather than explicitly right rotating the data bits in the L half after each stage, we can
compensate by modifying the control bits. This is shown in Figure 3.4(c). How the control
bits are derived will be explained later in Section 3.2.4.
We now show that any pdep operation can be mapped to the butterfly network.
Fact 3.1. Any single data bit can be moved to any result position by just moving it
to the correct half of the intermediate result at every stage of the butterfly network.
This can be proved by induction on the number of stages. At stage 1, the data bit
is moved within n/2 positions of its final position. At stage 2, it is moved within n/4
positions of its final result, and so on. At stage lg(n), it is moved within n/2lg(n) = 1
position of its final result, which is its final result position. Referring back to Figure 3.4(b),
we utilized Fact 3.1 to decide which bits to keep in R and which to transfer from R to L at
44
each stage.
Fact 3.2. If the mask has k “1”s in it, the k rightmost data bits are selected and
moved, i.e., the selected data bits are contiguous. They never cross each other in
the final result.
This fact is by definition of the pdep instruction. See the example of Figure 3.4(a)
where there are 5 “1”s in the mask and the selected data bits are the 5 rightmost bits,
defgh; these bits are spread out to the left maintaining their original order, and thus never
crossing each other in the final result.
Fact 3.3. If a data bit in the right half (R) is swapped with its paired bit in the left
half (L), then all selected data bits to the left of it will also be swapped to L (if they
are in R) or stay in L (if they are in L).
Since the selected data bits never cross each other in the final result (Fact 3.2), once a
bit swaps to L, the selected bits to the left of it must also go to L. Hence, if there is one
“1” in the mask, the one selected data bit, d0, can go to R or L. If there are two “1”s in the
mask, the two selected data bits, d1d0, can go to RR or LR or LL. That is, if the data bit on
the right stays in R, then the next data bit can go to R or L, but if the data bit on the right
goes to L, the next data bit must also go to L. If there are three “1”s, the three selected data
bits, d2d1d0, can go to RRR, LRR, LLR or LLL. For example, in Figure 3.4(b) stage 1, the
five bits have the pattern LLRRR as e is transferred to L and d must then stay in L.
Fact 3.4. The selected data bits that have been swapped from R to L, or stayed in
L, are all contiguous mod n/2 in L.
From Fact 3.3, the destinations of the k selected data bits dk−1 . . .d0 must be of the form
L. . .LR. . .R, a string of zero or more L’s followed by zero or more R’s (see Figure 3.5).
Define X as the bits staying in R, Y as the bits going to L that start in R and Z as the bits
going to L that start in L. It is possible that:
45
i. X alone exists – when there are no selected data bits that go to L,
ii. Y alone exists – when there are no selected data bits that start in L and all bits that start
in R go to L,
iii. X and Y exist – when there are no selected data bits that start in L and some bits that
start in R stay in R and some go to L,
iv. X and Z exist – when all the bits in R are going to R, and all bits going to L start in L,
or
v. X , Y and Z exist.
When X alone exists (i), there are no bits that go to L, so Fact 3.4 is irrelevant.
The structure of the butterfly network requires that when bits are moved in a stage, they
all move by the same amount. Fact 3.2 states that the selected data bits are contiguous.
Together these imply that when Y alone exists or X and Y exist (ii and iii), Y is moved as
a contiguous block from R to L and Fact 3.4 is trivially true.
When X and Z exist (iv), Z is a contiguous block of bits that does not move so again
Fact 3.4 is trivially true.
When X , Y and Z exist (v), Y comprises the leftmost bits of R, and Z the rightmost
bits in L since they are contiguous across the midpoint of the stage (Fact 3.2). When Y is
swapped to L, since the butterfly network moves the bits by an amount equal to the size of L
or R in a given stage, Y becomes the leftmost bits of L. Thus Y and Z are now contiguous
mod n/2, i.e., wrapped around, in L (Figure 3.5).
Thus Fact 3.4 is true in all cases.
For example, in Figure 3.4(b) at the input to stage 1, X is bits fgh, Y is bit e and Z
is bit d. Y is the leftmost bit in R and Z is the rightmost bit in L. After stage 1, Y is the
leftmost bit in L and is contiguous with Z mod 4, within L, i.e., de is contiguous mod 4 in
e00d.
46
Figure 3.5: At the output of a butterfly stage, Y and Z are contiguous mod n/2 in L and can berotated to be the rightmost bits of L.
Fact 3.5. The selected data bits in L can be rotated so that they are the rightmost
bits of L, and in their original order.
From Fact 3.4, the selected data bits are contiguous mod n/2 in L. At the output of
stage 1 in Figure 3.5, these bits are offset to the left by the size of X (the number of bits
that stayed in R), denoted by |X|. Thus if we explicitly rotate right the bits by |X|, the
selected data bits in L are now the rightmost bits of L in their original order (Figure 3.5).
In Figure 3.4(b), Fact 3.5 was utilized prior to stages 2 and 3.
At the end of this step, we have two half-sized butterfly networks, L and R, with the
selected data bits right-aligned and in order in each of L and R (last row of Figure 3.5). The
above can now be repeated recursively for the half-sized butterfly networks, L and R, until
each L and R is a single bit. This is achieved after lg(n) stages of the butterfly network (see
Figure 3.4(b)).
The selected data bits emerge from stage 1 in Figure 3.5 rotated to the left by |X|. In
Fact 3.5, the selected data bits are explicitly rotated back to the right by |X|. Instead, we
can compensate for the rotation by modifying the control bits of the subsequent stages to
limit the rotation to be within each subnetwork. For example, if the n-bit input to stage 1 is
rotated by k positions, the two n/2-bit inputs to the L and R subnetworks are rotated by k
47
(mod n/2) within each subnetwork. At the output of stage lg(n), the subnetworks are 1-bit
wide so the rotations are absorbed.
Fact 3.6. If the data bits are rotated by k positions left (or right) at the input to a
stage of a butterfly network, then at the output of that stage we can obtain a rotation
left (or right) by k positions of each half of the output bits by rotating left (or right)
the control bits by k positions and complementing them upon wrap around.
Consider again the example of Figure 3.4(b). The selected data bits emerge from stage
1 left rotated by 3 bits, i.e., L is e00d, left rotated by 3 positions from 00de. In Fig-
ure 3.4(b), we explicitly rotated the data bits back to the right by 3. Instead, we can
compensate for this left rotation by left rotating and complementing upon wrap around
by 3 positions the control bits of the subsequent stages. This is shown in Figure 3.6.
For stage 2, the control bit pattern of L, after left rotate and complement by 3, becomes
10 → 00 → 01 → 11. The rotation by 3 is limited to a rotation by 3 mod 2 = 1 within
each half of the output of L of stage 2 as the output is transformed from d0,0e in Fig-
ure 3.4(b) to 0d,e0 in Figure 3.6. For stage 3, the rotation and complement by 3 of the two
single control bits in L become three successive complements: 1 → 0 → 1 → 0, and the
left rotation of L is absorbed as the overall output is still d0e0. Hence, Figure 3.6 shows
how the control bits in stages 2 and 3 compensate for the left rotate by 3 bits at the output
of stage 1 (cf. Figure 3.4(b).)
Figure 3.4(c) shows the control bits after also compensating for the left rotate by 1 bit
of RL and LL, prior to stage 3 in Figure 3.6. The explicit right rotation prior to stage 3 is
eliminated. Instead, the two control bits in RL and LL transform from 0 → 1 to absorb the
rotation. Hence, the overall result in Figure 3.4(c) remains the same as in Figure 3.4(b).
The explicit data rotations in Figure 3.4(b) are replaced with rotations of the control bits
instead, complementing them on wraparound.
We now explain why the control bits are complemented when they wrap around. The
48
Figure 3.6: The elimination of explicit right rotations after stage 1 in Figure 3.4(b).
goal is to keep the data bits in the half they were originally routed to at each stage of the
butterfly network, in spite of the rotation of the input.
Figure 3.7(a) shows a pair of bits, a and b, that were originally passed through. So
we wish to route a to L and b to R in spite of any rotation. As the bits are rotated (Fig-
ure 3.7(b)), the control bit is rotated with them, keeping a in L and b in R, as desired.
When the bits wrap around, (Figure 3.7(c)), a wraps to R and b crosses the midpoint to
L. If the control bit is simply rotated and wrapped around with the paired bits, then a is
passed through to R and b is passed through to L, which is contrary to the originally desired
behavior. If instead the control bit is complemented when it wraps around (Figure 3.7(d)),
then a is swapped back to L and b is swapped back to R, as is desired.
Similarly, if a and b were originally swapped (Figure 3.8a)), a should be routed to R
and b to L. As the bits rotate (Figure 3.8(b)), we simply rotate the control bit with them.
When the bits wrap around (Figure 3.8(c)), input a wraps to R and b crosses to L. When
they are swapped, a is routed to L and b to R, contrary to their original destinations. If
49
(a) (b)
(c) (d)
Figure 3.7: (a) a pair of data bits initially passed through; (b) rotation of the paired data bits andcontrol bit; (c) wrapping of the data bits and control bit; (d) complementation of the control bit.
the control bit is complemented on wraparound, a is passed through to R and b is passed
through to L, conforming to the originally desired behavior.
Thus complementing the control bit when it wraps and changing the behavior of the
paired bits from pass through to swap or vice versa causes each of the pair of bits to stay in
the half to which it was originally routed despite the rotation of the input pushing each bit
to the other half. This limits the rotation of the input to be within each half of the output
and not across the entire output.
We now give a theorem to formalize the overall result:
Theorem 3.7. Any parallel deposit instruction on n bits can be implemented with
one pass through a butterfly network of lg(n) stages without path conflicts (with the
bits that are not selected zeroed out externally).
Proof. We give a proof by construction. Assume there are k “1”s in the right half of the bit
mask. Then, based on Fact 1, the k rightmost data bits (blockX in Figure 3.5) must be kept
in the right half (R) of the butterfly network and the remaining contiguous selected data bits
must be swapped (block Y ) or passed through (block Z) to the left half (L). This can be
50
(a) (b)
(c) (d)
Figure 3.8: (a) A pair of data bits initially swapped; (b) rotation of the paired data bits and controlbit; (c) wrapping of the data bits and control bit; (d) complementation of the control bit.
accomplished in stage 1 of the butterfly network by setting the k rightmost configuration
bits to “0” (to pass through X) and the remaining configuration bits to “1” (to swap Y ).
At this point, the selected data bits in the right subnetwork (R) are right-aligned but
those in the left subnetwork (L) are contiguous mod n/2, but not right aligned (see Fact 3.4
and Figure 3.5) – they are rotated left by the size of block X or the number of bits kept in
R. We can compensate for the left rotation of the bits in L and determine the control bits
for subsequent stages as if the bits in L were right aligned. This is accomplished by left
rotating and complementing upon wraparound the control bits in the subsequent stages of
L by the number of bits kept in R (once these control bits are determined pretending that
the data bits in L are right aligned). Modifying the control bits in this manner will limit
the rotation to be within each half of the output until the rotation is absorbed after the final
stage (Fact 3.6).
Now the process above can be repeated on the left and right subnets, which are them-
selves butterfly networks: count the number of “1”s in the local right half of the mask and
then keep that many bits in the right half of the subnetwork and swap the remaining selected
data bits to the left half. Account for the rotation of the left half by modifying subsequent
51
Figure 3.9: Even (or R, dotted) and odd (or L, solid) subnetworks of the inverse butterfly network.
control bits.
This can be repeated for each subnetwork in each subsequent stage until the final stage is
reached, where the final parallel deposit result will have been achieved (e.g., Figure 3.4(c)).
3.2.2 Parallel Extract on the Inverse Butterfly Network
We will now show that pex can be mapped onto the inverse butterfly network. The inverse
butterfly network is decomposed into even and odd subnetworks, in contrast to the butterfly
network which is decomposed into right and left subnetworks. See Figure 3.9, where the
even subnetworks are shown with dotted lines and the odd with solid lines. However, for
simplicity of notation we refer to even as R and odd as L.
Fact 3.8. Any single data bit can be moved to any result position by just moving
it to the correct R or L subnetwork of the intermediate result at every stage of the
inverse butterfly network.
This can be proved by induction on the number of stages. At stage 1, the data bit is
52
moved to its final position mod 2 (i.e., to R or L). At stage 2, it is moved to its final position
mod 4, and so on. At stage lg(n), it is moved to its final position mod 2lg(n) = n, which is
its final result position.
Fact 3.9. A permutation is routable on an inverse butterfly network if the destina-
tions of the bits constitute a complete set of residues mod m (i.e., the destinations
equal 0, 1, . . . ,m− 1 mod m) for each subnetwork of width m.
Based on Fact 3.8, bits are routed on the inverse butterfly network by moving them to the
correct position mod 2 after the first stage, mod 4 after the second stage, etc. Consequently,
if the two bits entering stage 1, (with 2-bit wide inverse butterfly networks as shown in
Figure 3.9), have destinations equal to 0 and 1 mod 2 (i.e., one is going to R and one to L),
Fact 3.8 can be satisfied for each bit and they are routable through stage 1 without conflict.
Subsequently, the four bits entering stage 2 (with the 4-bit wide butterfly networks) must
have destinations equal to 0, 1, 2 and 3 mod 4 to satisfy Fact 3.8 for each bit and be routable
through stage 2 without conflict. A similar constraint exists for each stage.
Theorem 3.10. Any Parallel Extract (pex) instruction on n bits can be imple-
mented with one pass through an inverse butterfly network of lg(n) stages without
path conflicts (with the un-selected bits on the left zeroed out).
Proof. The pex operation compresses bits in their original order into adjacent bits in the
result. Consequently, two adjacent selected data bits that enter the same stage 1 subnetwork
must be adjacent in the output – one bit has a destination equal to 0 mod 2 and the other has
a destination equal to 1 mod 2. Thus the destinations constitute a complete set of residues
mod 2 and thus are routable through stage 1. The selected data bits that enter the same
stage 2 subnetwork must be adjacent in the output and thus form a set of residues mod 4
and are routable through stage 2. A similar situation exists for the subsequent stages, up to
the final n-bit wide stage. No matter what the bit mask of the overall pex operation is, the
selected data bits will be adjacent in the final result. Thus the destination of the selected
53
data bits will form a set of residues mod n and the bits will be routable through all lg(n)
stages of the inverse butterfly network.
3.2.3 Need for Two Datapaths
It would be convenient if both pex and pdep can be implemented using the same datapath
circuit. Unfortunately, this is not possible. There is one unique path between any input
position and output position in the butterfly or inverse butterfly network (see Fact 3.1 and
Fact 3.8 – once a bit has been moved to the correct right or left subnetwork, subseqeunt
stages cannot route bits back to the other subnetwork). Consequently, we need only show
one counter example where paths conflict to prove that pex on butterfly and pdep on
inverse butterfly are not possible in the general case.
We first consider trying to implement pex using a butterfly circuit. From Fact 3.1
we see that parallel extract cannot be mapped to the butterfly network. Parallel extract
compresses bits and thus it is easy to encounter scenarios where two bits entering the same
switch would both require the same output in order to be moved to the correct half (L or R)
corresponding to their final destinations. Consider the parallel extract operation shown in
Figure 3.10. In order to move both bits d and h to the correct half of their final positions,
both must be output in the right half after stage 1. This clearly is a conflict and thus parallel
extract cannot be mapped to butterfly.
We now consider implementing pdep using an inverse butterfly circuit. From Fact 3.8
we see that parallel deposit cannot be mapped to the inverse butterfly network. Parallel
deposit scatters right-aligned bits to the left, and thus it is easy to encounter scenarios
where two bits entering the same switch would both require moving to the same left, or
right, subnetwork. Consider the parallel deposit operation shown in Figure 3.11. In order
to move both bits g and h to their final positions mod 2, both must be output in the right
subnet (i.e., even network: 0 mod 2) after stage 1. This clearly is a conflict and thus parallel
54
Figure 3.10: Conflicting paths when trying to map pex to butterfly.
deposit cannot be mapped to the inverse butterfly datapath.
3.2.4 Decoding the Bitmask into Butterfly Control Bits
The steps in the proof to Theorem 3.7 give an outline for how to decode the n-bit bitmask
into controls for each stage of a butterfly datapath. For each right half of a stage of a
subnetwork we count the number of “1”s in that local right half of the mask, say k “1”s,
and then set the k rightmost control bits to “0”s and the remaining bits to “1”s. This serves
to keep block X in the local R half and export Y to the local L half (refer to Figures 3.3
and 3.5 for nomenclature). We then assume that we explicitly rotate Y and Z to be the
rightmost bits in order in the local L half and iterate through the stages to come up with an
initial set of control bits. After this, we eliminate the need for explicit rotations of Y and Z
by modifying the control bits. This is accomplished by a left rotate and complement upon
wrap around (LROTC) operation, rotating the control bits by the same amount obtained
when assuming explicit rotations.
We will now simplify this process considerably. First, note that when we modify control
55
Figure 3.11: Conflicting paths when trying to map pdep to inverse butterfly.
bits to compensate for a rotation in a given stage, we do so by propagating the rotation
through all the subsequent stages. This means that when the control bits of a local L are
modified, they are rotated and complemented upon wrap around by the number of “1”s in
the local R, and by the number of “1”s in the local R of the preceding stage, and by the
number of “1”s in all the local R’s of all preceding stages up to the R in the first stage. In
other words, the control bits of the local L are rotated by the total number of “1”s to its
right in the bitmask.
Consider the example of Figure 3.4(b). The control bit in stage 3 in the LL subnetwork
is initially a “1” when we assumed explicit rotations. We first rotated and complemented
this bit by 3, the number of “1”s in R of the bitmask: 1 → 0 → 1 → 0 (Figure 3.6). We
then rotated and complemented this bit by another 1 position, the number of “1”s in LR of
the bitmask: 0 → 1. This yielded the final control bit in Figure 3.4(c). Overall we rotated
this bit by 4, the total number of “1”s to the right of LL or to the right of position 6. This is
a population count (POPCNT) of the bitstring from the rightmost bit to bit position 6.
Second, we need to produce a string of k “0”s from a count (in binary) of k, to derive
56
Figure 3.12: LROTC(“1111”, 3) = “1000”.
the initial control bits assuming explicit rotations. This can also be done with a LROTC
operation. We start with a string of all “1”s of the correct length and then, for every position
in the rotation, we wrap around a “1” from the left and complement it to get a “0” on the
right. The end result, after a LROTC by k, is a string of the correct length with k rightmost
bits set to “0” and the rest set to “1” (Figure 3.12, where k = 3).
We can now combine these two facts: the initial control bits are obtained by a LROTC
of a zero string the length of the local R by the POPCNT of the bits in the mask in the local
R and all bits to the right of it. We denote a string of k “1”s as 1k. We specify a bitfield
from bit h to bit v as {h:v}, where v is to the right of h. So,
• for stage 1, we calculate the control bits as
LROTC(1n/2, POPCNT(mask{n/2− 1:0})),
• for stage 2, we calculate the control bits as
LROTC(1n/4, POPCNT(mask{3n/4− 1:0})) for L and
LROTC(1n/4, POPCNT(mask{n/4− 1:0})) for R,
• for stage 3, we calculate the control bits as
LROTC(1n/8, POPCNT(mask{7n/8− 1:0})) for LL,
LROTC(1n/8, POPCNT(mask{5n/8− 1:0})) for LR,
LROTC(1n/8, POPCNT(mask{3n/8− 1:0})) for RL, and
LROTC(1n/8, POPCNT(mask{n/8− 1:0})) for RR,
and so on for the later stages.
57
Let us verify that this is correct using the example of Figure 3.4(c):
LROTC(12, POPCNT(“101101”)) = LROTC(11, 4) = 11 for L and
LROTC(11, POPCNT(“01”)) = LROTC(12, 1) = 10 for R,
• for stage 3, the control bit is
LROTC(11, POPCNT(“0101101”)) = LROTC(1, 4) = 1 for LL,
LROTC(11, POPCNT(“01101”)) = LROTC(1, 3) = 0 for LR,
LROTC(11, POPCNT(“101”)) = LROTC(1, 2) = 1 for RL, and
LROTC(11, POPCNT(“1”)) = LROTC(1, 1) = 0 for RR.
This agrees with the result shown in Figure 3.4(c).
Overall, the population counts of the bits from position to 0 to position k, for k = 0
to n − 2, are required. We call this the set of prefix population counts. One interesting
observation is that for stage 1 we need one count of n/21 bits, for stage 2 we need the two
counts of the odd multiples of n/22 bits, for stage 3 we need the four counts of the odd
multiples of n/23 bits and so on.
Using the two new functions we have defined, LROTC and POPCNT, we present an
algorithm (Algorithm 3.11, Figure 3.13) to decode the n mask bits into the n × lg(n)/2
control bits for pdep (and for pex). Algorithm 3.11 summarizes the discussion above for
obtaining the control bits for achieveing pdep on a butterfly circuit.
We note that Algorithm 3.11 can also be used to obtain the control bits for the inverse
butterfly for a pex operation, with the one caveat that the controls for stage i of the butterfly
datapath are routed to stage lg(n)−i+1 in the inverse butterfly datapath. This can be shown
using an approach similar to that for pdep, except for working backwards from the final
58
Algorithm 3.11. To generate the n× lg(n)/2 butterfly or inverse butterfly control bits fromthe n-bit mask.
Input: mask; the bitmaskOutput: bcb; the lg(n)× n/2 matrix containing the butterfly control bits
ibcb; the lg(n)× n/2 matrix containing the inverse butterfly control bits
Let x||y indicate the concatenation of bit patterns x and y. POPCNT(a) is the populationcount of “1”s in bitfield a. mask{h:v} is the bitfield from bit h to bit v of the mask. 1k
indicates a bit-string of k ones. LROTC(a, rot) is a “left rotate and complement on wrap”operation, where a is the input and rot is the rotation amount.
1. Calculate the prefix population counts:
For i = 1, 2, . . ., n− 2pc[i] = POPCNT(mask{i:0})
2. Calculate the butterfly (and inverse butterfly) control bits for each stage by perform-ing LROTC(1k, pc[m]), where k is the size of the local R and m in the set of theleftmost bit positions of the local R’s:
bcb = ibcb = {}For i = 1, . . ., lg(n) //for each stagek = n/2i //number of bits in local RFor j = 1, 3, 5, . . ., 2i − 1 //for each local Rm = j × k − 1 //the leftmost bit position of the local Rtemp = LROTC(1k, pc[m])bcb[i] = temp || bcb[i]ibcb[lg(n)− i+ 1] = temp || ibcb[lg(n)− i+ 1]
Figure 3.13: Algorithm for decoding a bitmask into controls for a butterfly (or inverse butterfly)datapath.
stage. The derivation is given in more detail in an appendix to this chapter (Section 3.5).
3.2.5 Hardware Decoder
The execution time of Algorithm 3.11 in software is approximately 675 cycles on an Intel
Pentium-D processor. This software routine is useful only for pex or pdep operations for
which the bit mask is known ahead of time. However, for dynamic masks, we require a
hardware decoder that implements Algorithm 3.11 in order to achieve a high performance.
59
Fortunately, Algorithm 3.11 just contains two basic operations, population count and left-
rotate-and-complement, both of which have straightforward hardware implementations. A
block diagram of the decoder is show in Figure 3.14.
Figure 3.14: Hardware decoder for dynamic pdep and pex operation (for pex, the order of theoutputs is reversed).
Before describing the decoder in detail, we make note of a simplification based on the
properties of rotations – that they are invariant when the rotation amount differs by the
period of rotation, which for an k-bit LROTC is 2k. Thus, for the ith stage of the butterfly
network, k = n/2i (see Algorithm 3.11) and the POPCNTs are only computed mod n/2i−1.
For example, for the 64-bit hardware decoder, for the 32 butterfly stage 6 POPCNTs corre-
sponding to the odd multiples of n/64, we need only compute the POPCNTs mod 2 – only
the least significant bit; for the 16 butterfly stage 5 POPCNTs, we need only compute the
POPCNTs mod 4 – the two least significant bits; and so on. Only the POPCNT of 32 bits
for stage 1 requires the full lg(n)-bit POPCNT. In total, 120 bits are computed instead of
the 378 bits that would have been computed had the full lg(n)-bit POPCNTs been required
for each position.
The first stage of the decoder is a parallel prefix population counter. This is a circuit
that computes in parallel all the population counts of step 1 of Algorithm 3.11. In designing
this circuit we considered population counter architectures and parallel prefix networks for
60
(a) (b)
(c) (d)
Figure 3.15: (a) Count of 8 bits; (b) Count of 24 bits; (c) Count of 32 bits; (d) Count of 56 bits
adders.
The carry shower counter [75] groups the input into sets of three lines which are input
into full adders. The sum and carry outputs of the full adders are each grouped into sets of
three lines which are input to another stage of full adders and so on.
In our circuit, in the first stage we group 8 bits into 3 sets due to the structure of the
butterfly and inverse butterfly networks, i.e., the subnetworks have power of 2 sizes. In
stage 2, we add three sums and three carries to produce the sum of the 8 bits in carry-save
form. Figure 3.15(a) shows the first two stages of the counter. The output of this stage is a
2-bit sum and 2-bit carry.
We then gather 3 sets of sums and carries in stage 3 and produce the sum of 24 bits
61
in carry-save form (this requires a second full adder delay within the stage as shown in
Figure 3.15(b)). In stage 4, we gather up to three intermediate results and reduce these
to carry-save form. Stage 4 for the 32-bit count, which requires the full lg(n) bits, is
shown in Figure 3.15(c) and for the 56-bit count, which is computed mod 16, is shown
in Figure 3.15(d). We then use a carry-propagate adder to complete the addition. The
complete network for the 32-bit, 56-bit and 63-bit count are shown in Figure 3.16.
The parallel prefix architecture resembles radix-3 Han-Carlson [76], which is a parallel
prefix look-ahead carry adder that has lg(n) + 1 stages with carries propagated to the odd
positions in the extra final stage to limit fanout at the earlier levels. In our circuit, we
defer the 1- and 2-bit counts to the end, similar to odd carries being deferred in the Han-
Carlson adder, as these are easy to compute from the other counts. This can be seen in
Figure 3.16(c). The radix-3 nature stems from the carry shower counter design, as we
group 3 lines to input into a full adder at each level. Thus, the counter has log3(64) + 2 = 6
stages (where each stage can have multiple full adder delays). We note that in order to
effeciently compute the 3-bit and greater counts, the first two stages are overlapped every 4
positions. This can be seen in Figure 3.17(a) which shows the diagram of the entire parallel
prefix network. The numbers at the top are bit positions. The numbers at the bottom
indicate the number of bits in the sum. The trapezoids indicate adder blocks such as those
in Figure 3.15. The squares indicate the final carry-propagate adder, which can be as small
as 1-bit.
The parallel prefix population count circuit computes the population count of the low 32
bits, the low 16 bits and the low 8 bits (and the low 4 bits and the low 2 bits). With a minor
extension, the population count of the full 64 bit word can be computed (Figure 3.17(b)).
Similarly, the population count of each byte is computed in carry save form and with min-
imal logic can be transformed into the full 4-bit count of each byte. By modifiying the
network, we can also compute the population count of each 16-bit or 32-bit subword. Thus
the parallel prefix population count circuit can be used as the basis of a population count
explains why in Table 3.2, the variable circuits (Figures 3.22 and 3.23) have approximately
18-30% longer cycle time latencies compared to the static case (Figure 3.20). They are also
2.5 to 3.7 times larger than the static unit.
3.4 Summary
This chapter introduces the parallel extract instruction to perform bit gather operations
and the parallel deposit instruction to perform bit scatter operations. Datapaths for these
instructions were also discussed. We showed that that parallel extract maps to the inverse
butterfly network and parallel deposit maps to the butterfly network.
We then presented an algorithm for decoding the pex or pdep bitmask into the con-
trol bits for the datapaths. A hardware implementation of this algorithm was also pre-
sented. This hardware decoder has two basic components – parallel prefix population count
(POPCNT) and left-rotate-and-complement (LROTC) subcircuits.
With this foundation, we presented the design of a standalone advanced bit manip-
73
ulation functional unit that supports bfly, ibfly, pex and pdep. The high cost of
the hardware decoder that translates the pex or pdep bitmask into the datapath controls
prompted the definition of two variants of the pex and pdep instructions – static versions,
in which the datapath control bits are precomputed using Algorithm 3.11 in software (by
the compiler or programmer), and loop-invariant versions, in which hardware decoding is
performed once and then the static versions of the instructions are used. This led to the
design of two alternative functional units – one that supports only static pex and pdep
operations and one that supports variable pex.v and pdep.v instructions as well as loop
invariant operations. The unit that supports only the static instructions is faster (0.95× the
latency) and smaller (0.9× the area) than an ALU.
The next chapter of the thesis presents a new design for the standard shifter using the
advanced bit manipulation functional unit as its basis.
3.5 Appendix: Decoding the pexBitmask into Inverse But-
terfly Control Bits
We will now show that the inverse butterfly control bits for a pex operation can also be
obtained using Algorithm 3.11. This can be shown using an approach similar to that used
for pdep on butterfly in Section 3.2.
In the final stage of the inverse butterfly network, we call X the selected data bits that
pass through in R, Y the selected data bits that are transferred from L to R and Z the
selected data bits that pass through in L. Consider the possible combinations of X , Y and
Z:
i. X alone exists – when there are no selected data bits in L,
ii. Y alone exists – when there are no selected data bits in R,
74
Figure 3.24: Final stage of pex operation.
iii. X and Y exist – when no selected data bits that start in L stay in L,
iv. X and Z exist – when all the data bits in R are selected and there are some selected
data bits in L, or
v. X , Y and Z exist.
When X exists (i, iii, iv and v), it must be the rightmost bits of R, as the output of pex
is the selected data bits compressed maintaining their relative ordering and right justified
in the result.
When Y exists (ii, iii and v), it must be rotated left from the midpoint by the size of X
at the input to the final stage so that Y and X will be contiguous at the output of the final
stage after Y is swapped from L to R.
When Z exists (iv and v), it must be the rightmost bits in L so that when it is passed
through in L in the final stage it is contiguous with the bits in R.
When Y and Z exist (v), Y must additionally be the leftmost bits of L so that when
it is swapped into R in the final stage it is contiguous to Z on its left. Consequently, Y
and Z must be contiguous mod n/2 in L. Additionally we can view Y and Z as being the
rightmost bits of L at the output to stage lg(n)− 1 which are then left rotated by |X| prior
to the input to stage lg(n). See Figure 3.24.
Thus, prior to the final stage we have two half-sized inverse butterfly networks, L and R,
75
with a pex operation performed within each half network (first row of Figure 3.24 where
selected data bits ZY are the right justified and compressed selected data bits in L and X is
the right justified and compressed selected bits in R). We then iterate backwards through the
stages, performing a local pex operation within each half subnetwork. Between the stages
we explicitly left rotate each local L of by the number of selected data bits in the local R.
We use Fact 3.8 to route the bits in the local pex operation. This process is illustrated for
an example pex operation in Figure 3.25(a), broken down into steps in Figure 3.25(b).
(a)
(b) (c)
Figure 3.25: (a) 8-bit pex operation (b) mapped onto inverse butterfly network with explicit leftrotations between stages and (c) without explicit rotations of data bits by modifying the control bits.
76
In the first stage in Figure 3.25(b), we perform a local pex within each 2-bit subnet-
work:
• a belongs in position 6 for the local 2-bit pex, so it is swapped from L to R to yield
0a;
• c belongs in position 4, so it is swapped to yield 0c;
• e and f are in the correct positions (3 and 2, respectively), so they passed through to
yield ef;
• h is in position 0, so it is passed through to yield 0h.
Then, prior to stage 2, we explicitly left rotate each local L of stage 2 by the number of “1”
bits in the corresponding local R of stage 2: 0a is left rotated by 1 to a0, as there is one
selected data bit (c) in the corresponding local R, and ef is also left rotated by 1 to fe, as
there also is one selected data bit (h) in the corresponding local R.
In stage 2, we perform a local pex within each 4-bit subnetwork:
• a belongs in position 5 and c in position 4 for the local 4-bit pex, so a is swapped
to yield 00ac.
• e belongs in position 2, f in position 1 and h in position 0, so f is swapped to yield
0efh.
Prior to stage 3, we explicitly left rotate L of stage 3, 00ac, by 3, the number of selected
bits in R of stage 3 (efh), to c00a.
In the final stage, c belongs in position 3 so it is swapped from L to R to yield the
overall output 000acefh.
Rather than explicitly left rotating L after each stage, we can incorporate the rotations
directly into the routing of the pex operation by modifying the control bits. This is based
77
on a property of inverse butterfly networks that given a permutation peformed by the in-
verse butterfly network, a rotation of that permutation can also be achieved by rotating and
complementing the control bits by the same amount. (This can be shown following using
an argument similar to one used for Fact 3.6, the analogous fact for butterfly networks.)
Stage i of the inverse butterfly network can add 2i positions to the rotation, i.e., at
the output of the first stage the bits can be rotated by 0 or 1 positions from the initial
permutation. In the second stage, the bits can be rotated an additional 0 or 2 positions, or
the bits can rotated from the inputs by 0, 1, 2 or 3 positions and so on. Thus, we can rotate
the output of any stage of a subnetwork starting from any input and any initial permutation.
Consider the example of Figure 3.25(b), where we wish to eliminate the explicit data
rotations in the local L halves by rotating the control bits instead. To add a left rotation by
1 of 0a and ef though stage 1, we (left rotate and) complement the control bit for those
two subnetworks by 1 and we get the desired result (a0 and fe) as shown in Figure 3.26.
Then, to add a left rotation by 3 in L though stage 2, we left rotate and complement the
control bits by 3 in the first 2 stages of L. The first stage control bits are complemented 3
times in succession: 0 → 1 → 0 → 1 and 1 → 0 → 1 → 0 and the second stage control
bits are left rotated and complemented by 3 positions: 10→ 00→ 01→ 11. The result is
shown in Figure 3.25(c), where we see no explicit rotations, yet nevertheless the input of L
in stage 3 is still c00a and the overall output is still 000acefh.
Hence, the derivation of the control bits for the inverse butterfly network for a pex
operation is in fact identical to the derivation of the control bits for the pdep on the butterfly
network. At each stage, starting from stage 1, in each subnetwork, we pass through the k
rightmost bits by setting the k rightmost control bits to “0”, if there are k “1”s in the local
R of the mask. This is generated by LROTC(“1 . . . 1”, k). We then left rotate the output
by the number of “1”s in the next local R by left rotating and complementing the control
bits from stage 1 through the current stage by that amount. When we add up the effect of
rotating the outputs of each stage – that each rotation requires rotating the control bits from
78
(a)
Figure 3.26: 8-bit pex operation with explicit left rotation after stage 1 eliminated.
stage through that stage 1 – the net result is a LROTC of the control bits by the total number
of “1”s to the right of the subnetwork.
So we arrive at the same result: the initial control bits are obtained by a LROTC of a
string of all “1”s the length of the local R by the number of “1”s in the local R of the mask
and are then corrected by a further LROTC by the total number of “1”s to the right in the
mask. These can be combined into a single LROTC of the all one string by the total number
of “1”s in the local R and to the right. Thus Algorithm 3.11 also yields the configuration
bits for mapping pex to inverse butterfly, with the one caveat that the controls for stage i
of the butterfly datapath is equivalent to stage lg(n)− i+ 1 in the inverse butterfly datapath
(as indicated in the final line of Algorithm 3.11).
79
Chapter 4
Advanced Bit Manipulation Functional
Unit as a Basis for Shifters
In the previous chapter, we described a standalone advanced bit manipulation functional
units that supports bit permutation, bit gather and bit scatter operations. In this chapter,
we propose using the advanced bit manipulation functional unit as a new basis for shifters,
rather than simply adding it to a processor core, or enhancing the current shifter to also
support the above advanced bit manipulation functions. We propose replacing two existing
functional units (the shifter and the multimedia-mix functional units in Itanium processors)
with the new permutation functional unit and performing all the operations previously done
on these existing shifters as well as the advanced bit permutation, bit gather and bit scatter
operations on the same datapath.
This chapter is organized as follows. Section 4.1 lists the basic shifter instructions.
Section 4.2 then shows that the advanced bit manipulation functional unit can perform
any of these basic operations. The hard part is determining how the controls of the lg(n)
stages of the circuit should be set, and we give definitive algorithms for this. Section 4.3
gives a detailed description of the complete new shift-permute functional unit. Section 4.4
discusses how this new functional unit can be considered an evolution of the design of
80
shifters and compares the new design to the classic log shifter. Section 4.5 summarizes this
chapter.
4.1 Basic Shifter Operations
The basic shifter instructions consist of two groups and are summarized in Table 4.1. The
first group consists of the shift and rotate instructions supported in essentially all micropro-
cessors. These instructions include right or left shifts (with zero or sign propagation), and
right or left rotates. While a few microprocessors support only shifts but not rotates, we will
consider rotate as a basic supported operation. The second group of instructions include
extract and deposit instructions and mix instructions, all introduced in Chapter 2. No ISA
currently supports mix for subwords smaller than a byte. In our proposed new functional
unit, mix for bits – and for all subword sizes that are powers of 2 – are supported. This
includes 12 mix operations: mix.left and mix.right for each of 6 subword sizes of
20, 21, 22, 23, 24, 25, for a 64-bit datapath processor.
4.2 Basic Shifter Operations on the Inverse Butterfly Dat-
apath
The key conceptual insight comes from recognizing that the set of basic shifter and mix
operations in Table 4.1 is based on minor variations on a rotation operation, and that any
rotation operation can be achieved on an inverse butterfly circuit (or on a butterfly circuit.)
Theorem 4.1. An inverse butterfly circuit can achieve any rotation of its input.
This is a well-known property of inverse butterfly circuits. A proof of this can be found
in [78], where rotations are called cyclic shifts.
81
Table 4.1: Standard shifter operations
Instruction Descriptionrotr r1 = r2, shamt
Right rotate of r2. The rotate amount is either an immediate in theopcode (shamt), or specified using a second source register r3.
rotr r1 = r2, r3
rotl r1 = r2, shamtLeft rotate of r2.
rotl r1 = r2, r3shr r1 = r2, shamt Right shift of r2 with vacated positions on the left filled with the sign
bit (most significant bit of r2) or zero-filled. The shift amount iseither an immediate in the opcode or specified using a second sourceregister r3.
Left shift of r2 with vacated positions on the right zero-filled.shl r1 = r2, r3extr r1 = r2, pos, len Extraction and right justification of single field from r2 of length len
from position pos. The high order bits are filled with the sign bit ofthe extracted field or zero-filled.
extr.u r1 = r2, pos, len
dep.z r1 = r2, pos, len Deposit at position pos of single right justified field from r2 of lengthlen. Remaining bits are zero-filled or merged from second sourceregister r3.
dep r1= r2, r3, pos, len
mix {r,l} {0,1,2,3,4,5} Select right or left subword from a pair of subwords, alternatingbetween source registers r2 and r3. Subword sizes are 2i bits, for i =0, 1, 2, . . ., 5, for a 64-bit processor.
r1 = r2, r3
Corollary 4.2. An enhanced inverse butterfly circuit can perform on its input:
a. Right and left shifts
b. Extract operations
c. Deposit operations
d. Mix operations
Proof. This follows from Theorem 4.1, with these operations modeled as a rotate with
additional logic handling zeroing, or sign extension from an arbitrary position, or merging
bits from the second source operand. Mix is modeled as a rotate of one operand by the
subword size and then a merge of subwords alternating between the two operands.
As the inverse butterfly circuit only performs permutations without zeroing and without
82
replication, the circuit must be enhanced with an extra 2:1 multiplexer stage at the end that
either selects the rotated bits “as is” or other bits which are precomputed as either zero, or
the sign bit (replicated), or the bits of the second source operand, depending on the desired
operation.
Corollary 4.3. Theorem 4.1 and Corollary 4.2 are true for the butterfly network as
well.
Proof. The butterfly and inverse butterfly networks exhibit a reverse symmetry of their
stages from input to output. Thus a rotation on the inverse butterfly network is equivalent
to a rotation in the opposite direction on the butterfly network when the flow through the
network is reversed (see Figure 4.1). Hence, a butterfly circuit can also achieve any rotation
of its inputs. As in Corollary 4.2, a butterfly network enhanced with an extra multiplexer
stage at the end is needed to handle zeroing or sign extension or merging bits from the
second source operand.
Figure 4.1: Left rotate by three on inverse butterfly is equivalent to right rotate by three on butterfly.
We first show how control bits are obtained for rotations on an inverse butterfly circuit
83
in Section 4.2.1, then for the other operations in Section 4.2.2.
4.2.1 Determining the Control Bits for Rotations
To achieve a right (or left) rotation by s positions, for s = 0, 1, 2, . . ., n− 1, using the n-bit
wide inverse butterfly circuit with lg(n) stages, the input must be right (or left) rotated by
s mod 2j within each 2j-bit wide inverse butterfly circuit at each stage j. This is because
from stage j + 1 on, the inverse butterfly circuit can only move bits at granularities larger
than 2j positions (so the finer movements must have already been performed in the prior
stages). We first give a conceptual explanation of this, then a formal constructive proof to
obtain the actual control bits for a rotation.
An n-bit inverse butterfly circuit can be viewed as two (lg(n) − 1)-stage circuits fol-
lowed by a stage that swaps or passes through paired bits that are n/2 positions apart. To
right rotate the input inn−1 . . . in0 by s positions, the two (lg(n) − 1)-stage circuits must
have right rotated their half inputs by s′ = s mod n/2 and the input to stage lg(n) must be
standard cell implementation. Compared to the standard cell implementation of Table 4.6,
the FPGA unit that supports variable pex.v and pdep.v is faster and larger relative to
the log shifter, perhaps due to how the addition in the parallel prefix population counter
is synthesized. We note that the ALU is faster and smaller than all the shifter designs,
possibly due to specialized adder cells used in the ALU implementation.
From both standard cell and FPGA implementations we that supporting variable pex.v
and pdep.v operations has high cost, similar to the case with the standalone advanced ma-
nipulation units in Section 3.3. One may choose to simply pay this cost in order to support
the greatest range of operations. Alternatively, one may simply omit support for variable
operations and use the functional unit with only bfly, ibfly, static pex and pdep,
thereby forcing the use of a software routine to generate the control bits whenever variable
pex.v and pdep.v are performed. As we will see in Chapter 6, this may be a sufficient
alternative.
We remark that full custom designs of the ALU, log shifter and our new shift-permute
unit should be done, since standard cell or FPGA implementations may not reflect a fair
or accurate comparison – especially between the shifters and the ALU, which is typically
highly optimized by custom circuit design. Such circuit design is more appropriately done
by microprocessor custom circuit designers according to implementation-specific needs
105
and the process technology used.
4.5 Summary
The advanced bit manipulation functional unit can be enhanced to support the operations
usually performed by the standard shifter. We showed that standard shifter operations are
all variations on rotations and then designed a rotation control bit generator circuit, which
produces the control bits for the inverse butterfly (or butterfly) datapath based on the shift
amount. This new design is an evolution of shifter architectures away from the classic
barrel and log shifters.
As a new basis for shifters performing only existing shift, rotate, extract, deposit and
mix instructions, our new ibfly-based design has 1.01× the latency and 0.87× the area of
the log-based shifter. Our new unit that also supports static pex and pdep operations, has
1.11× the latency and 1.83× the area of the log shifter (standard cell implementation), but
is much more capable.
In the following chapter, we consider the bit matrix multiply operation which is used
to accelerate parity computation and can also perform a superset of the bit manipulation
operations. In Chapter 6 we analyze the applications that benefit from advanced bit manip-
ulation instructions.
106
Chapter 5
Bit Matrix Multiply
Another powerful bit manipulation operation is bit matrix multiply (bmm). bmm, which
multiplies two bit matrices contained in processor registers, performs a superset of the
advanced bit manipulation operations discussed in Chapter 3. Furthermore, bmm has many
additional capabilities such as finite field multiplication and parity computation. In this
chapter we discuss how bmm, which has heretofore been supported only by supercomputers
such as Cray [82], can be implemented in a commodity microprocessor.
This chapter is organized as follows. Section 5.1 gives the definition of the bit matrix
multiply operation. Section 5.2 discusses the basic operations that bmm can be used to
perform. Section 5.3 describes methods of computing bmm using existing instruction set
architectures. Section 5.4 proposes new architectural enhancements to accelerate bmm.
Section 5.5 compares the new proposals to the existing techniques. Section 5.6 discusses
related work in bmm implementation. Section 5.7 summarizes this chapter.
5.1 Definition of Bit Matrix Multiply Operation
The bit matrix multiply operation, bmm.n, multiplies two (n × n) bit matrices to produce
a third (n × n) bit matrix. The operation is defined in the first line of Table 5.1. The rows
107
Table 5.1: Definitions of Bit Matrix Multiply Instructions
Operation Definition Pseudo-Codebmm.n C = A, B A, B, C: for i from 1 to nBit Matrix Multiply n× n bit matrices for j from 1 to n
C = A × B mod 2 ci,j = ai,1b1,j ⊕ ai,2b2,j ⊕ . . .⊕ ai,nbn,j
bmmt.n C = A, B A, B, C: for i from 1 to nBit Matrix Multiply n× n bit matrices for j from 1 to n(with implicit transpose) C = A × BT mod 2 ci,j = ai,1bj,1 ⊕ ai,2bj,2 ⊕ . . .⊕ ai,nbj,n
of a matrix are numbered ascending from top to bottom, starting at 1. Within a row we
follow the normal matrix numbering convention starting at 1 on the left. (When referring
to bit numbering within a register, position n − 1 is on the left and 0 is on the right.) An
alternative formulation which transposes the B matrix (bmmt.n) is given in the second
line of Table 5.1. Note that the transpose is implicit – the instruction input is B, but the
multiplication is by BT .
The first version of bmm most closely resembles standard matrix multiplication and
therefore is most useful when performing finite field arithmetic. The transpose version
is most useful when we are taking the dot products of the rows of A and the rows of B.
Given the general correspondence between vector dot product and matrix multiplication
(a • b = a × bT for row vectors a and b), this version naturally follows. The transpose
version can also be used when performing finite field arithmetic if the data happen to be in
the transpose form. The transpose version of bmm is the version implemented by Cray.
In particular, we mention two cases for n, n = 8 and n = 64. The bmm.8 instruction
multiplies two 8 × 8 matrices contained in general purpose registers (see Fig 5.1). The
matrices are laid out such that row i corresponds to byte 8 − i in little-endian order. The
bmm.64 instruction multiplies two 64 × 64 matrices and allows for operations on whole
processor words at once. The Cray instruction is equivalent to bmmt.64.
108
(a)
(b)
Figure 5.1: (a) Layout of 8× 8 matrix in a register rB; (b) bmm.8 operation (1 bit of output)
5.2 Bit Matrix Multiply Basic Operations
We describe some basic operations that a bmm operation can perform.
Finite Field Multiplication – Bit matrix multiply can be used to perform multiplication
over GF(2k).
Subset Parity – Bit matrix multiply, bmm or bmmt, can be used to calculate the parity
of an arbitrary subset of the bits of a row of A. The bit positions from which the parity is
computed is specified by setting the corresponding bit positions to “1” in a column of B (or
row of B for the transpose version). Usage of bmmt may be easier for parity as the masks
can be directly specified in processor registers.
Transpose – The transpose version, bmmt, can perform bit matrix transpose by setting
A to I, the identity matrix. By Table 5.1 second row, C then equals BT .
109
Permutations – bmm can be used to perform arbitrary bit-level permutations with rep-
etitions and zeroing. A B matrix with a single “1” in each row and column produces a
permutation of the columns of the A matrix (“column exchange”). A B matrix with a sin-
gle “1” in each column, but more than one “1” in at least one row produces a permutation
with repetition. A matrix with a zero column will zero the corresponding column of the
result. Thus bmm can be used to perform any of the operations described in Chapter 3
(see Figure 5.2 for an example of bmm being used to form pex). bmm can also perform a
standard bit-wise and (with “1”s only appearing on the main diagonal), shift (with “1”s
only appearing on any single non-main diagonal) or rotate (with “1”s appearing on any
single wrapped-around diagonal). If, instead of B, the A matrix is a permutation matrix,
then the rows (or subwords) of B are permuted (“row exchange”).
In general, for an n-bit LFSR, the A matrix is given by:
A =
0 . . . 0 p1
p2
In−1
...pn
where the p vector is the coeffecients of the generator polynomial and is used to compute
the feedback and the (n − 1)-bit Identity matrix shifts the rest of the state. Note that to
compute up to k cycles of the LFSR at once, we multiply the state by Ak.
Error Correction – Error correction coding adds redundancy to the message so that the
symbols can be recovered in the face of channel noise. One basic error correction scheme
is linear block codes, in which parity bits are computed for various subsets of the bits of
a message block. The generation of multiple parity patterns can be obtained by using a
bmm of the message symbols with a generator matrix. In convolutional coding, the input
is processed as a continuous stream. One or more shift registers store some number of
previous input bits and the outputs are the parity of subsets of the stored bits and the next
input bits. Convolutional codes can also be sped up by bmm expressing the parity generation
140
in matrix form and then replicating and shifting that matrix to acheive the effect of the shift
registers.
Puncturing – In puncturing, message bits are removed after error correction to remove
some redundancy. Removing message bits is accomplsihed by a pex instruction. Depunc-
turing is performed at the decoder to put back in bits and uses a pdep instruction.
6.2 Performance Results
We coded kernels for the above applications and simulated them using the SimpleScalar
Alpha simulator [100] enhanced to recognize our new instructions by replacing unused op-
codes with our new instruction defintions. The simulator parameters are given in Table 6.2.
Figure 6.7 shows speedups for pex and pdep, and Figures 6.8 and 6.9 show results for
bmm. In each case, the results are normalized with respect to the baseline Alpha ISA in-
struction counts.
For pex and pdep we tested bit compression of a long stream of bytes; LSB steganog-
raphy coding replacing the low 4 bits of 16-bit PCM encoded audio; uuencoding using a
long stream of bytes; integer compression using a long stream of integers; binary image
morphology using a 512× 512 bit image; BLASTX translate using a long stream of DNA
bases; and random number generation using a long stream of bits from the entropy pool.
The baseline implementation used the algorithms from Hacker’s Delight [16].
Overall, the speedups over baseline range from 1.07× to 9.9×, with an average of
2.19× (1.84× for static pex and 1.93× for static pdep). The benchmark that exhibited
the greatest speedup was random number generation as performing pex.v in software is
much more expensive that performing static pex in software. Of the other benchmarks,
integer compression was sped up the least. This is due to the extra bookkeeping required
to track when compressed integers cross word boundaries. The speedup is also lower for
141
Table 6.2: SimpleScalar Parameters
instruction fetch queue 32decode width 4issue width 4ruu size 128load/store queue 32memory ports 2ALUs 4multipliers 2pex/pdep units 1bmm units 2∗
ptlu units 1static pex/pdep latency 1pex.v/pdep.v latency 3bmm latency 1ptrd latency 2L1I cache 32KBL1D cache 32KBL2 cache 500KB∗Only one set of extra registers, but two AND-XOR trees
binary image morphology due to extra work in extracting the 3 × 3 pixel neighborhoods
that cross register boundaries and is also lower for steganography as there are only 4 fields
per word, limiting the effect of the new instructions.
We also tested the BLASTZ program using the simulator. The results were highly
dependent on the inputs chosen. For shorter sequences, in the range of 104 bases, speedups
up to 1-2% were observed. For longer sequences, 105 bases, only a minor speedup was
observed – around 0.1%. Clearly, BLASTZ can benefit from pex, but this benefit can be
very minor in some cases.
For bmm, we first ran a set of artificial benchmarks. We tested a generic (N × 64) ×
(64 × 64) → (N × 64) matrix multiply for N = 64, 1024 and 65536 and multiplication
of two 1024 × 1024 matrices (one case where the B matrix is known in advance and one
where it is not). The baseline algorithm is serial table lookup. The results are shown in
Figure 6.8.
142
Figure 6.7: Speedup of applications using pex and pdep
Figure 6.8: Speedup of artificial benchmarks using bmm
143
The results for the (N × 64)× (64× 64)→ (N × 64) basic test case are similar to the
results of Table 5.7 only for N = 1024. For N = 64, the overhead of loading the B matrix
for bmm.64 and loading the tables for PTLU adversely affects performance, as indicated
in Table 5.8, while the overhead is negligible for the other cases. For N = 65536, cache
misses become significant. bmm.64 and PTLU are the most affected because they require
the fewest computation operations, thus making them most sensitive to memory latency.
For matrix multiplication using precomputed matrices (indicated by a * in Figure 6.8),
again we see the effect of the overhead for bmm.64 and PTLU. PTLU is also negatively
impacted by caches misses due to the need to store all the tables (4MB of tables). For matrix
multiplication using a new B matrix (indicated by a ** in Figure 6.8), the baseline table
lookup case becomes worse due to the need to compute the tables. This overhead is greater
than that of resizing the matrices. Consequently, bmm.8, bmm.16.ac and bmm.64 are
better relative to the baseline.
We then evaluated the use of bmm in the applications. We tested the linear cryptanalysis
algorithm of Figure 2.19 (inner loop only); random number generation using a 768 × 256
Toeplitz matrix; block coding of a long stream of bits using the Golay code; convolutional
coding using the encoder of Figure 2.23; and transpose of a 1024×1024 matrix. The results
are shown in Figure 6.9.
For linear cryptanalysis, the results are affected by the extra work required to maintain
the counts for each mask. This limits the maximum speedup attainable (ala Amdahl’s
Law). As bmmt.64 speeds up the parity computation portion the most, it most clearly
exhibits this limitation, even though N is large enough in this case to amortize the costs
of changing B. Additionally, the transpose version of bit matrix multiply is used, as we
assume the masks, plaintexts and ciphertexts are all in rows. This impacts PTLU, as the
transpose of the matrices is computed in software.
Overhead of changing matrices also impacts the results for the random number gen-
eration benchmark. As N = 768, bmm.64 has worse performance as compared to the
144
Figure 6.9: Speedup of applications using bmm
1024 × 64 multiplication or the 1024 × 1024 multiplication. For PTLU, performance is
worse than the 1024 × 64 case, due to the overhead, but better than 1024 × 1024 case as
there are fewer tables and no cache misses.
For Golay encoding, bmm.64 and PTLU slow down, not due to the overhead of chang-
ing the B matrix or tables, but rather due to the fact that the input stream must be rearranged
to be multiplied by the fixed 24× 48 generator matrix (the 12× 24 generator matrix is tiled
twice). bmm.8 gets faster in this case as the generator matrix decomposes well into the 8×8
submatrices. bmm.16.ac is slower than bmm.8, due to the sparseness of the generator
matrix (expressed as a 48 × 96 matrix to better fit with the 16 × 16 submatrices). There
are 10 multiplies for both bmm.8 and bmm.16.ac and it is not possible to fully amortize
the cost of loading the B submatrix for each bmm.16.ac.
Compared to the Golay encoder, the results for the convolutional encoder are better
for all instructions. This is due to the structure of the B matrix – there are fewer multiply
instructions required, but more table lookups, as the table lookup method requires an entire
set of 8 rows to be zero to avoid a lookup, while multiplication requires a small 8 × 8
or 16 × 16 block to be zero to skip a submatrix multiply. The structure of the matrix for
convolutional encoding decomposes better for bmm.16.ac, so it is faster than bmm.8.
145
Table 6.3: Incremental performance gain and area increase for submatrix multiply
Incremental Increase in Performance Incremental Increase in Area(versus 2× bmm.8
bmm.16.ac vs. bmm.8 1.4 2.2bmm.64 vs. bmm.8 3.2 12.3bmm.64 vs. bmm.16.ac 2.3 5.7
Additionally, the Golay encoder is a rate 1/2 encoder while the convolutional coder is a
rate 2/3. Thus, as there are fewer outputs per input for the convolutional encoder, more
input rows can be processed at a time without encountering register pressure, allowing for
greater amortization of loading the B matrix.
For transposition, bmmt.64 suffers as the B matrix is loaded every time and few mul-
tiplications are required for bmmt.8 and bmmt.16.ac as A (= I) is very sparse.
Overall, bmm.8 is 1.6× faster than the baseline case, bmm.16.ac is 2.2× faster,
bmm.64 is 5.0× faster and PTLU is 3.6× faster. However, we see that bmm.8 and
bmm.16.ac are mostly insensitive to extremes or changes in parameters. bmm.64 and
PTLU are, on the hand, highly sensitive due to the need to amortize the cost of loading the
B matrix or tables. Furthermore, the cost of computing tables is significant. We also see
sensitivity to awkward matrix sizes and sparse matrices. bmm.8 is most able to handle the
former, as long as the matrix can be decomposed into 8 × 8 submatrices, and the latter, as
few submatrices are required.
In Table 6.3, we consider the incremental performance gain and area increase going
from bmm.8 to bmm.16.ac to bmm.64 (for units that support two multiplies). Going
from bmm.8 to bmm.16.ac to bmm.64 does provide increasing performance, but at a
steeper and steeper cost. Clearly, bmm.64 has the best overall performance, but it also
has the greatest cost. The bmm.8 instruction provides reasonable performance gains, is
insensitive to parameter changes, can take advantage of sparsity, requires no extra state and
can, by itself, perform many of the existing SIMD ISA subword manipulation instructions.
146
6.3 Summary
In this chapter, we analyzed applications that benefit from the newly proposed instruc-
tions. These applications include bit compression and decompression, least significant bit
steganography, binary image morphology, transfer coding, bioinformatics, integer com-
pression, random number generation, error correcting coding, matrix transposition and
cryptology.
Our results show that the applications the benefit from pex and pdep were sped up
2.19× on average. However, only one of our benchmarks made use of loop-invariant pex
and pdep, one made use of the variable pex.v instructions and none used pdep.v.
Consequently, it may not be necessary to provide hardware support for the loop-invariant
and variable operations, especially as the hardware decoder has high cost.
The benchmarks for bit matrix multiply show that bmm.8 is 1.6× faster than the base-
line case, bmm.16.ac is 2.2× faster, bmm.64 is 5.0× faster and PTLU is 3.6× faster,
on average. bmm.64 and PTLU have the best overall performance, but they also have the
greatest cost.
147
Chapter 7
Conclusions and Future Research
This thesis presented a number of contributions concerning the design, implementation and
benefits of advanced bit manipulation instructions for commodity processors. In this chap-
ter, we summarize the work done in the thesis and propose directions for future research.
7.1 Advanced Bit Manipulation Instructions
This first contribution of the thesis is the novel parallel extract and parallel deposit instruc-
tions. In our analysis of bit-oriented instructions, we saw that specialized bit permutation
instructions were lacking. We proposed the parallel extract instruction to perform bit gather
operations and the parallel deposit instruction to perform bit scatter operations. We showed
how these instructions utilize the butterfly and inverse butterfly datapaths and developed an
algorithm for decoding the bitmask that specifies the bit gather or scatter operations into
the controls for the datapaths. We designed a hardware circuit, composed of a parallel pre-
fix population count subcircuit and a set of left rotate and complement subcircuits, which
implements our decoding algorithm.
148
7.2 Advanced Bit Manipulation Functional Unit
We designed a standalone advanced bit manipulation functional unit that implements paral-
lel extract, parallel deposit and the butterfly and inverse butterfly permutation instructions.
Due to high cost of the hardware decoder, we considered the design of two alternative units
– one that supports only static pex and pdep operations for which the datapath control
are precomputed and one that supports variable pex.v and pdep.v instructions as well
as loop invariant operations, for which the hardware decoder is used once and the control
bits are then captured and reused. The unit that supports only the static instructions is faster
(0.95× the latency) and smaller (0.9× the area) than an ALU.
7.3 Advanced Bit Manipulation Functional Unit as Basis
for the Shifter
We proposed that the advanced bit manipulation functional unit, rather than being stan-
dalone, should replace the standard shifter. We showed how rotation operations are per-
formed on butterfly and inverse butterfly datapaths and how instructions such as shift, ex-
tract, deposit and mix are just minor variations on rotation. We designed a rotation control
bit generator circuit that produces the datapath control bits from the shift amount.
This new shifter architecture represents an evolution of shifter design from the classic
barrel and log shifter. The new shifter supporting static pex and pdep is 11% slower
than and 1.8× larger than log shifter while being able to perform a much richer array of
operations. The shifter supporting variable pex.v and pdep.v is 2.1× slower and 2.4×
larger, but supports the widest range of operations.
149
7.4 Implementation of Bit Matrix Multiply in Commodity
Processors
Another contribution of the thesis consists of architectural proposals to accelerate the bit
matrix multiply operation. Bit matrix multiplication is used to speed up parity computation
and can perform a superset of the bit manipulation operations. This instruction is supported
by the Cray supercomputer and we considered alternatives suitable for a commodity pro-
cessor. One proposal is submatrix multiplication, in which the multiplicand matrix B is
stored in extra registers in the functional unit. We also considered using the Parallel Ta-
ble Lookup module, which allows for looking up multiple tables in parallel, for bit matrix
multiply, as the best current technique for bit matrix multiply is the table lookup method.
For submatrix multiplication, we investigated the choice of sizes for multiplier and
multiplicand matrices and determined that bmm.8, which multiplies two 8×8 matrices (in
general purpose registers), and bmm.16.ac, which multiplies a 4× 16 matrix in a general
purpose register by a 16×16 matrix in the extra storage and then accumulates with a 4×16
matrix in a general purpose register, were the best choices. The bmm.8 unit is 30% the
size of an ALU, and thus one may consider implementing bmm.8 functionality within the
ALU. The bmm.8 unit also requires no additional architectural register state and thus does
need operating system support.
7.5 Application Analysis
The final contribution of this thesis is the analysis of applications that benefit from bit
manipulation operations. We showed how diverse applications such as bit compression,
image manipulation, communication coding, random number generation, bioinformatics,
integer compression and cryptology are all sped up by our new instructions. Overall, the
applications that benefited from pex and pdep were sped up 2.19× on average. Of the
150
applications considered only one uses pex.v. This, together with high cost of the pex.v
functional unit, indicates that preferred functional unit should be that only supports static
pex and pdep.
The applications that benefit from bit matrix multiply were sped up 1.6× using bmm.8,
2.2× using bmm.16.ac, 5.0× using bmm.64, and 3.6× using PTLU, on average. Over-
all, bmm.64 was 3.2× faster than bmm.8 and bmm.16.ac was 1.4× faster. However,
the bmm.16.ac unit is 2.2× the size of bmm.8 and the bmm.64 unit is 12.3× the size.
Clearly, going from bmm.8 to bmm.16.ac to bmm.64 does provide increasing perfor-
mance, but at a steeper and steeper cost. To achieve the best performance, a full Cray-like
bmm.64 unit can be implemented. The PTLU unit has good performance, but is very
costly. However, it can be used for many purposes aside from bit matrix multiply.
7.6 Future Research
There are a number of possible directions for future work. The first possibility is compiler
support for bit manipulation instructions. In coding our simulations, we used compiler
intrinsics, essentially inline assembly with better syntax, to access our instructions. This
approach helps the programmer who is diligent enough to discover that the compiler sup-
ports the intrinsics. Ideally, the compiler will recognize the code sequence that is equivalent
to the instructions and automatically generate code that uses the instructions. This is likely
a difficult task, as there are many ways that these operations can be coded in software.
Other future work includes refinement of the implementation of the advanced bit manip-
ulation instructions. There may be better designs for the parallel prefix population counter
or, in fact, better ways entirely for implementing pex and pdep. Additionally, the designs
were synthesized primarily focused on minimizing latency. Other designs that consider
area and power may prove interesting to examine.
Application analysis is also likely a fruitful area for further work. We do not claim to
151
have performed an exhaustive listing of all the applications that can benefit from bit manip-
ulation instructions. There are likely other algorithms that can be sped up by our instruc-
tions. Furthermore, designing algorithms that specifically take into account the existence of
advanced bit manipulation instructions is an area that has seen little research. Conversely,
application analysis might yield other candidates for new specialized advanced permutation
instructions.
Bit matrix multiply in a multicore setting is another avenue for future research. This
area includes considering how best to decompose large matrix multiplies for parallel pro-
cessing. It is assumed that prior work in decomposing standard matrix multiplication has
relevance, but the unique overheads involved with setting up the B matrices need to be
considered. Another point to evaluate is the tradeoff between having many cores each con-
taining a small bit matrix multiply unit (such as bmm.8), versus having only some cores
contain a larger unit (such as bmm.64).
Overall, we have brought the acceleration of advanced bit manipulation operations out
of the realm of “programming tricks.” We have shown that these operations are needed by
many applications and that direct support for them can be implemented in a commodity