This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Hierarchical Memory with Block Transfer
Alok AggarwalAshok K. Chandra
Marc Snir
mMThomas J.. W~tson.Centert P. Q. Box 218Yorkto~n J:leightstNew Yorkt 10598.
AbstractIn this paper we introduce a model of Hierarchical Memory
with Block Transfer (BT for short). It is like a random access ma
chinet except that access to location x takes time f(x)t and 'a block
of consecutive locations can be copied from memory to memoryt
taking one unit of time per element after the initial access time.
We first study the model with f(x) = XiX for 0 < a < 1. A tight
bound of S(n log log n) is shown for many simple problems: read
ing each inputt dot product, shuffle exchanget and merging two
sorted lists. The same bound holds for transposing a vn x vnmatrix; we use this to compute an FFT graph in optimal
S(n log n) time. An optimal S(n log n) sorting algorithm is also
shown. Some additional issues considered are: maintaining data
structures such as dictionariest DAG simulationt and connectionswith PRAMs.
Next we study the model f(x) = x. Using techniques similar to
those developed for the previous modelt we show tight bounds of
S(n log n) for the simple problems mentioned above, and provide
a new technique that yields optimal lower bounds of O(n log2n) for
sorting, computing an FFT graph, and for matrix transposition.
We also obtain optimal bounds for the model f(x) = XiX with
a> 1.
Finallyt we study the model f(x) = log x and obtain optimal
bounds of S(n log *n) for simple problems mentioned above and
of S(n log n) for sorting, computing an FFT graph, and for somepermutations.
1. INTRODUCTION
1.1 Background
Large computers usually have a complex memory hierarchyconsisting of a small amount of fast memory (registers) followedby increasingly larger amounts of slower memory, which may in
clude one or two levels of cache, main memory, extended store,
drums, disks, and mass store. Efficient execution of algorithms in
such an environment requires some care in making sure the data
are available in fast memory most of the time when they are
needed. Compilers, machine architectures, and operating systems
attempt to help by doing register allocation, cache management,
or demand paging, but ultimately the algorithm designer can influ-
some r; these are the items with the same m - k most significant
bits. Similarly, a k-output-block consists of 2k
items that occupy
locations r2k +1, ... ,(r + 1)2
kwhen the permutation has been
computed. The proof of Theorem 3.2 relies on the following ob-
servations: '"1..
• A rational permutation To where a(i) = i for i < am (i. e., the
corresponding bit permutation involves permuting only the most
significant m - ram 1 bits) can be achieved in O(n) time. This is
because such permutation preserves ram l-input-blocks; it can be
achieved by moving each ram l-input-block (of size approximatelyniX). The cost of each move is O(n
ex), i. e., in constant time per
element.
• Any rational permutation 'Ta where a(i) = i unless i €
{k - 1, k - 2, ... , k - L 0.25(1 - a)mJ} for any k ~ 0.25m
can be achieved in O(n) time. This follows because we can move
each am-input-block from the first O(n) memory locations to thefirst O(n ex) memory locations at cost O(n ex) per block move; we can
move each a2m-input-block from the first O(n ex) memory locations
2 2to the first O(n ex ) memory locations at cost n ex per block move,
and, so on. Each move takes constant time per element. In
2/ 10g(l/a) iterations k-input-blocks are moved to the first O(2k
)
memory locations. We can then permute the data within these
blocks by applying the previous observation.
• Any rational permutation 'Ta where aU) = i unless i ~ m/4 can
be achieved inO(n) time; this is accomplished as follows. Partition
the most significant r0.75m 1 bits into intervals of size
LO.125(1 -a)mJ each and denote these intervals by
207
ap' ap-l' ... , at where p = 6/(1 - a). It is not hard to see thatany permutation on the set {m - 1, m - 2, ... , r0.25m l} can
be expressed as the product of O(p2) permutations, where each of
these permutations affect at most two consecutive intervals
(ai' ai-I)' Since each pair of intervals has at most 0.25(1 - a)m
bits, the corresponding rational permutation on n elements can be
achieved in O(n) time by using the previous two observations.
Consequently, the rational permutation in which the corresponding
bit permutation requires the permutation of the most significant0.75 log n bits can be achieved in O(P2n) time, i. e., in O(n) time
since p is a constant that only depends upon·a.
Below, we use the divide-and-conquer technique and the previ
ous observations to achieve a rational permutation on n elements.
We assume that the input is in locations 2n + 1, 2n + 2, ... , 3n
and the output will be in 3n + 1, 3n + 2, ... , 4n.
Let 0 be a permutation on m - 1, ... , 1. It is not hard to see
that a can be expressed as the product of three permutations
01(12(13 ' where(1t and a3 are permutations on the most significantr0.75m 1 bits and a2 is a permutation on the least significant
LO.5m J bits. Now, 'To is also the product of 'T01 'T02T03' Furthermore, 'Tat' 'T03 can be computed in O(n) time using the above observations and T a is computed recursively by first moving
. k 22Lm/2J . . (.appropnate = contIguous Inputs 1. e.,
Lm/2 J-input-blocks) into locations 2k + I, ... , 3k and then ap
plying 'T02 to them.
The running time of this algorithm fulfills the recursion
T(n) = In x T(In) + O(n) where T(I) = constant. Thus, it
can be verified that T(n) = O(n log log n).
•If n is any power of 4 then a In x In matrix can be trans
posed using a rational permutation. In fact, if p and q are both
powers of two then a p x q matrix can be transposed using a suit
able rational permutation; below,we consider the general case of
transposing any p x q matrix.
Corollary 3.2: Ap x q matrix can be transposed on a BTex machine
with 0 < a < 1, in time O(pq log logpq).
Proo!: Let r = Zr logpl, s = 2 r logql, and let A be the given
p x q matrix. Then, perform the following steps:
(a) Expand A to an r x s matrix B such that B(iJ) =A(iJ) for
1 ~ i S p, 1 S j ~ q, and B(iJ) is arbitrary otherwise.(b) Transpose B to obtain B T.
(c) Extract A T from B T.
Clearly, step (b) is a rational permutation and can be com
puted in O(rs log log rs) time using Theorem 3.1. Since the compu
tation of step (c) is similar to that of step (a), we only show below
how to compute step (a).
Initially, P x q ~atrix A is in locations
2rq + 1, ... , 2rq + pq. For any qI>. 1, to recursively expand a
p x ql matrix Al that is stored in locations
2rql +1, ... , 2rql + pql in row major order, let
q2 = max(l, l (rql)a/r J) and partition AI. into rql/q2 1 s~bmatrices, each (except poss~bly the last submatrix) of size p x q2; move
every submatrix into locations 2rq2 +1, ... , 2rq2 + pq2;
recursively expand that submatrix, and copy the result into appro
priate locations in 3rql +1, ... , 4rql to obtain the r x ql expan
sion of Al (the case ql = 1 and the last submatrix of A1 are specialcases that are easily handled). The rpnning time is.Q(rs log logrs)
since there are at most r log log q/ log(l/a)l recursive levels eachof which takes a cumulative running time of O(rs).
•11Ieomn 3.3: An n-point FFT grap.~ can be computed inO(n log n) time on a BTa machine for fixed 0 < a < 1.
Proof Outline: An n-point FFT graph can be computed by
recursively computing 2Vi FFT graphs on Vi. points each and
two transpositions of Vi matrices of size In xVi. Now, we can
use the transposition algorithm given in Theorem 3.2 that takes
O(n log log n) time in the worst-case, so, if T(n) denotes the time
to compute this FFT graph, we have
T(n) S 2Vi x T(Vi) + ,en log log nand T(I) = constant. This
yields T(n) = O(n log n).
•We can sort a set S of n words in O(n log n log log n) time and
O(n) memory using simple merge sort and the fact that m~rgingcan
be done in O(n log log n) time in a BTa machine (Cf. Theorem 2.2
(iv». In the following, we demonstrate an optimal O(n log n) time
algorithm, Approx-Median-Sort(S), that takes O(n log log n) space.
Approx-Median-Sort (S) assumes that the elements of set
S = {sl' ... , sn} reside in locations 2n + 1, .. , 3n and it con
sists of the following five steps:
Step 1: Partition S into n I-a subsets, SI' ... , Sn I-a, of size na
each. For 1 S j S n I-a, execute the following substeps:
Substep 1.1: Bring the elements into locations indexed
2n a +1, ... , 3n a and sort Sj recursively by calling
Approx-Median-Sort (Sj)'
Substep 1.2: Form a set A containing the (i x log n)-th element
of Sj for 1 SiS na/log n.
Step 2 : After the completion of substep (1.2), A has n/ log n ele
ments. Furthermore, a simple analysis shows that the p-th smallest
element in A is greater than at least p x log n - 1 elements in Sand
this element is also greater than at mostp x log n + (n I-a -1) X log n -1 elements in S. Now, sort A by
using merge sort.
S*" J: Form a set, B = {bl'~' ..• ', b,.l of r =na
/ log n
approximate-partitioning elements, such that for 1 S / $ r J h/ equalsthe (1 x n t-a)_th smallest element of A. At this point, note that
because of the remark made in -step (2), -there are at -leastt )( n1-4 x log n - 1 elements,of S that .are less than b, and also at
208
most (I + 1) x n I-a X log n -1 elements of S that are less thanb,.
Step 4: Now, for all}, 1 S j S n I-a, partition the elements of Sj
into r +'1 subsets, Sj,O' Sj,I' ... , Sj" such 'that for every
x e; Sj,b hiS x < h/+ 1 (treating ho as - 00 and h'+1 as +00).
I-anCompute C/ = .U Sj,/ and let I C/ I = nl' Then, from the remarks
;~1 I-ain step (3), note tliat 1 S n/ S 2 x n x log n.
Step 5 : For 0 S I Sna/log n, bring C/ to the faster memory and
sort C/ by calling Approx-Median-Sort(C/) recursively.end ,.
Theorem .3.4: Algorithm Approx-Median-Sort(S) sorts n words inO(n log n)time on a BTa machine for 0 < a < 1.
Proof Outline: The correctness of Approx-M~dian-Sort(S)is easy
to establish and hence omitted. To analyze the running time, let
T(n) denote the time. taken by Approx-Median-Sort(S) to sort n
elements. Then, subatep .(1.1) can be executed in total timeI-a a
n T(n) + O(n) for all Sj' Substep (1.2) can be performed bytransposing a (n a/ log n) x nI-a log n matrix in time
O(n loglogn). Merge sort for m = n/ log n items instep (2) takes
time Oem log m log log m) = O(n log log n). The set B is com
puted instep (3) on set A,' in time O«n log log n)/ log n). Each set
Sj can be brought into fast memory and merged with the set B in.' l~a a a
total tIme O(n x (n log log n » = O(n log log n). At that
point the sets Sj,/ have been computed; they are stored in "row
major order;" i. e., the sets Sj,O' ... , Sj" are stored in contiguousmemory locations, for each 1 S j S r. We need to bring together
the sets SI,b ... , Sn I-a,1 for each 0 SiS r, i. e., store the matrixof sets Sj,/ in "column major order." From Theorem 3.5, this permutation· can· be computed in O(n( log log n)4) time and
O(n log log n) sp~ce. Finally, step (5) can ~e executed in
O(n log log n) + l: T(n/) time where r n/ =nandn/ S 2n I-a log n./;flns implies that the running '~e of this algorithm obeys the following recurrence relation:
naflog n
I-a a 4 ~T(n) S n T(n) + O(n( log log n) ) + L.J
/=0
nO:flog n 1where T(2), ==eoflStant, ~ n/ == n, and n/ S 2n -a log n. It canthen be verified that T(n) /;oO(n log n).
•Suppose the (iJ)-th entry of a p x q matrix, M, consists of a
set R(iJ) of data elements such that IR(iJ) I ~ 1,p q
.2 .2 IR{iJ) I = n, the entries of the matrix are stored in row1=1;=1major order, and the d;ita elements of R (iJ) stored in contiguous
memory. locati0DS. Then,the problem of Generalized Matrix
Tt:ilnsposition requires the entries:Qf R(iJ)-to be stored in column-
major order such that the data elements of each individual R{iJ)
are still located in contiguous memory locations.
The set C/ for all values of I between 1 andr, in step (4) of
Approx-Median-Sort(S) can be computed using a generalized matrix transposition for a matrix of size (n a/log n) "x n I-a.
Theorem 3.5: The generalized matrix transposition problem can be
solved in O(n( log log n)4) time and O(n log log n) space on a BTamachine, for any fixed a where 0 < a < 1.
Proof Outline: The proof of Theorem 3.5 is an extension of a simple
O(n( log log n)2) time algorithm for matrix transposition -- this will
be provided in the final version of the paper., Roughly, one extra
factor of log log n arises in the algorithm for generalized matrix
transposition because we need to determine the length of the blocks
to be moved and the other factor arises because of the more com
plicated storage allocation needed.
•4.MORE RESULTS FOR TH~ BTa MODEL (0 < a < 1)
4.1 Maintaining DictionariesAs mentioned in the proof of, Theorem 2.2, stacks and
deques can implemented in the BTa model in amortized time
O( log log n) per (insertion, deletion) operation. It is not hard to
show that searching can be performed in optimal time O(n a) using
a perfectly balanced binary search tree that is stored in memory in
breadth-first order. We now consider data structures that support
all three operations.
A dictionary is a data structure that supports the following
three operations: search (key), insert (key, value)~, and delete (key);
we call the last two operations update operations. For simplicity,
we assume that an insert with the key of an entry that is already
present in the dictionary replaces that entry; a delete for a nonex
isting key has no effect. We present an efficient dictionary struc
ture for the BTxtl model, where 0 < a < 1. This structure supportssearching in time O(n a), which is optimal; updates are done in
amortized time O( log n log log n).
We may assume that a dictionary is built by a sequerice of up
date operations, starting from an empty structure. The dictionary
can be considered to consist of'a sequence of entries, where each
entry represents one update request. An update operation adds' an
entry to the head of this sequence; a search operation returns the
most recent entry with a matching key value. The search fails if
there is no such entry, or if the most recent such entry is a delete.
The sequence of entries can be reordered and compressed -
updates on different keys commute with one another, and older
updates to a key can be deleted if there is a more recent update tothe same key.
Theorem 4.1: A dictionary can be maintained in the B Ta model
(0 < a < 1) so that the worst-case time to search an element in thedictionary is O(n a) after n operations, and it takes
209
O( log n log log n) amortized time to insert or delete an element inthe dictionary.
Proof: Our construction is similar to the static-to-dynamic trans
formation method for data structures of Bentley and Saxe [BS80].
We store the sequen~e as a list T1, ... , Tk of trees, withTt con
taining updates that are more recent than those in T2 and so on. ,1';is a complete binary search tree, containing at most 2
i-
1entries.
There are no duplicate entries for the same key in a tree; however,
there may be duplicate entries in distinct trees. After n operations,
k = O( log n).
Tree 1'; is stored between locations c2i
and (c + 1)2i-1 for
an appropriate c ~ 2, with interspersed "scratch" space from
(c + 1)2ito c2
i+1-1. We will think of 1'; as having 2
i-
1 nodes (by
padding, if necessary). 1'; is an i-level complete binary tree where
the key for any node is larger than the keysJor those in left subtree
and smaller than those its right subtree. We partition 1'; into sub
trees, each of which is an ai-level complete binary tree 1';,u rooted
at node U in 1'; (for this exposition, we ignore the case where ai is
not an integer). Each 1';,u is stored in contiguous locations in pre
order (i.e., each 1';,u is sorted by keys), and 1';,u is stored before
1';,vif u is nearer to the root of T; than v, or if they are at the samelevel and u is to the left of v. Thus, for a fixed i, the nodes in all
subtrees 1';,u at the same level are contiguous and sorted -- call this
set of nodes, a slice. The three operations are performed as fol
lows:
Search - Search successively each 1';, until an entry with amatching key value is found.
Update - Let To be a one node tree, representing the update
request. Starting with i = 0, recursively perform the following:
Merge 1'; with 1';+1' and store the result in 1';+1 (1'; remains empty);
if 1';+1 overflows, then recursively merge i.t with 1';+2'Tree T; can be searched in time O(2
al) as follows. First make
space in fast memory by copying the contents of locations
1, 2, ... , 2a
; -1 in time O(2ai
) into scratch t;torage starting at
locations (c + 1) x 2;' Then, the first 200 -1 locations in 1'; (i. e.,
1'; root) are copied into locations 1, ... , 2ai
in time O(2ai
) and
. 2 . 2
searched in O(i X 2,a ) time using ai probes each taking 0(2
,a )time. This determines which tree 1';,u is to be searched next (unlessthe key has already been found), and the process is repeated at
most r1/a 1 times, ~ompleting the search of 1';. Finally, we restore
the original contents of locations 1, ... , 2ai
-1. Thus, the total
time for searching all To, ... , 1iog n is O(n a).
Now, consider a sequence, of n updates being executed on a
dictionary that contains n valid entries. We claim that two trees
of size O(m) can be m~rged in time O(m log log m); the cost per
entry of a merge is O( log log m). Every time 1'; is merged with
1';+1' each entry in 1'; is moved to 1';+1' and this may cause oneentry to disappear (if it has the same key). It follows that an entry
is merged at most k = O( log n) times. The total cost of all themerges per entry is O( log n log log n).
We conclude the analysis by proving ourolaim on merge time.
To merge 1i and 1i+ l' first sort 1f, 1f+1; next, merge the sortedlists; and finally, reconstitute 1i+ I by essentially reversing the op
erations in the first step. We claim that each step takes
O(m log log m) time for ni == 2;. Since this bound has been shown
for merging, to prove this claim, it suffices to show how to sort 1iin this time bound. But this follows easily since the f 1/a 1 slices
that partition 1j are already stored as sorted lists, and hence can
be merged pairwise in time Oem log log m).
•It is interesting to note that the similarity of the, trees 1'; given
above and the B-trees - both can be thought of as having a large
fanout and keeping siblings contiguous. However, our structure is
usually more efficient for updates in BTa but its efficiency drops
when the number of valid dictionary entries becomes much lessthan the number of operations n.
Note that this algorithm does not determine the partition tree
T from the graph G nor the sets In(u), Out(u) -- only the times for
the data movement and computation in G are counted. We call such
an algorithm a nonuniform algorithm. Below, we provide two ap-
plications of the above algorithm that computes general directedacyclic graphs.
Corollary 4.2: If a straight-line algorithm takes T(n) time on aRAM then it can be simulated on a BTa machine (0 < ex < 1) inO(Tlog Tlog log n time by a nonuniform algorithm.
Proof: A straight line RAM algorithm of length T is represented bya bounded degree dag with Tnodes. Every n-node bounded degreedag is O(n)-separable. This implies that a dag with Tnodes can be
evaluated in the BTa model (0 < ex < 1) in time ten, where ten
fulfills the recursion: t(n = t(T1) + t(T2) + O( T log log n where
T1 + T2 = T, T/3 S T1,T2 S 2T/3. This recursion yieldsten = O(Tlog Tlog log n.
•The best known lower bound for DAG simulation seems to be
O(Tlog n which can be derived from Corollary 5.9.
Corollary 4.3: Two n x n matrices can be multiplied by a nonuniform algorithm in O(n
3) time on a BTa machine (0 < ex < 1) using
( +, x ) only.
Proof: Hong and Kung [HK81] have shown that the dag representing the classical matrix multiplication algorithm (that uses+ and x only) is 0(n
2/
3) -separable. Consequently, we can use
the above algorithm to show that two matrices can be multiplied inthe BT model in time t(n
3), where ten) fulfills the recursion
a 2/3ten) = tent) + t(~) + O(n log log n) where nl + ~ = n,
1/3n S nt, ~ S 2/3n. Since this recursion yields ten) = O(n), itfollows that two matrices can be multiplied in O(n 3) time which isoptimal within a constant factor.
•The restriction of nonuniformity can be eliminated for matrix
multiplication. Use a simple divide-and-conquer lJlethod: an n x n
matrix can be multiplied by computing 8 products of n/2. x n/2
matrices. Furthermore, a matrix with m elements can be brokeninto 4 submatrices by "unmerging" a list into four sublists in timeOem log log m). It follows that the running time of this algorithmfulfills the recursion ten) = 8t(n/2) + 0(n
210g log n), so that
ten) = O(n3
). The algorithm derived in Corollary 4.3 is essentiallysimilar.
4.3. Simulation of PRAM Algorithms by a BTa Machilll!
Theorem 4.4: Let t, s, andp denote, respectively, the time, memoryspace, and the number of processors required by a Concurrent Read
Concurrent Write PRAM algorithm. Then, this algorithm can besimulated in time OCt x (s log log s + p x logp» on a BTa machine (0 < ex < 1).
Proof: We use a simulation method due to Awerbuch, Israeli, andShiloach [AIS83]. The first O(P) locations in memory are used tostore processor records that represent the state of each simulatedprocessor; the next O(s) memory locations contain a list of memory
records representing the content of the PRAM memory. A PRAMstep is simulated as follows.
211
• The next instruction of each processor is simulated; if this instruction accesses memory then an access record containing thememory address, the processor id, and the stored value for awrite, is created. The time for this phase is O(p log logp).
• The access records are' sorted according to their address intime O(p log p) and write conflicts are eliminated.
• The list of access records is "merged" with the list of memoryrecords in time O((p + s) log log(p + s»= O(s log log s + p log p). The merge is actually an updateoperation that modifies each of the two lists: if an access record represents' a write, then the corresponding memory record is updated; if it is a read then the value of correspondingmemory record is copied to the access record -- concurrentreads are also handled here.
• The access records are sorted by processor id in timeO(p log p) and write conflicts are eliminated.
• The list of access records is "merged" with the list of processorrecords; the state of each processor that executed a read instruction is updated to contain the returned value.
The total time for the simulation of one PRAM step isO(s log log s + p logp).
•Corollary 4.5: The connected components of an undirected graphG with n vertices and m edges can be obtained in Oem log2n) timeon a BTa machine (0 < ex < 1).
Proof: Shiloach and Vishkin [SV82] have provided a parallel algorithm that takes O( log n) time and O(m) memory space usingn + m processors and that computes the connected components ofa graph with n nodes and m vertices.
•5. BOUNDS FOR OTHER MODELS
5.1 Boulids for BTx Model
Time bounds for various problems in the BTx model (i. e.,BTx a with ex = 1) are listed in the table given below. TheS(n log n) bounds for simple problems such as merging, shuffleexchange permutation, and the touch problem are proven using thesame techniques as in Theorems 2.1 and 2.2; the O(n log2n) upperbounds for matrix transposition and computing FFT graphs can beobtained by using log n stages of the shuffle-exchange permutationand that for sorting is obtained by using a simple merge sort algorithm. The dictionary algorithm is similar to that given in Theorem4.1. However, unlike the upper bounds, the lower bounds forcomputing FFT graphs,.matrix transposition, and sorting require anew technique which is described below.
A computation is conservative if the only operation used isblock moves. We have the following lower bound for conservativepermutation algorithms.
1'ItIJorem 5.1: The average number of steps required to perform arandomly chosen permutation on n items using a conservativecomputation on a BTx machine is D(n log2n).
Proof Outline: It is useful to define models ofa· two-level memory
system (denoted by L(m» and a specially-blocked BTx machine, andprove Lemmas 5.2 and 5.3 for these models.
We define a two-level memory system, L(m), as one in whichthere are two memories - a primary memory that consists of mwords (e. g. a cache) and a secondary memory that is pot~ntially
infinite in size. The processor is connected to the primary memory.Any set of b S m/2 data items that are present in the contiguouslocations of the secondary memory can be transferred to any b lo
cations in the primary memory in one transfer step. Conversely,contents of any b locations in primary memory can be copied into
contiguous locations in the secondary memory. In this model, it IS
assumed that the processor can perform any simple operation like
reading from (or writing into) primary memory or it can comparetwo words in the primary memory. Here, we are only in~erested inthe number of transfer steps that are required for solving a given
problem in this model and since Aggarwal and .Vitter [AV87] haveconsidered this model in some detail, we quote the following result
from their paper:
Lmuna 5.2 [A va1]: The average number of transfer steps neededto perform a random permutation on neleinents using a conservative computation on an L(m) machine is D( min(n ,
(n/m) x log(n/m»).
Proof : The proof of Lemma 5.2 can be found in [AV87 ].
•A specially-blocked B Tx machine is one in which the memory
is partitioned into sets of contiguous locations such that the i-th set,Sj, contains the 2; memory locations [2;, 2
i+
1 -1] and any algo
rithm for this machine transfers a block of contiguous data items
only from set Sj to set Sj_l or from Sj_l to Sj' for j.~ 1.
Lmuna 5.3: If a conservative algorithm runs on.a BTx machine intime T, then this algorithm can be modified into a conservative algorithm for a specially-blocked BTx machine that runs in time
O(n.
Proof: Partition the memory of a B Tx machine M' into sets ofcontiguous locations like the specially-blocked machine, i. e., thei-th set, s';, has size 2;. The simulating specially-blocked machine
M stores the content of s'; in the lower part of 8i+2; the upper partis used as a buffer. Consider a block move [x - I, x] .. [y - /J'] inM' where x >y (the case x <y is treated symmetrically). The costof this move is at least x. This move is simulated by M as follows:Let j = l log(x - l) J. Then, the block is contained in S'j U S'j+1
and hence in Sj+2 U Sj+3. If part of it is in Sj+3 then move it intothe buffer of Sj+2 to obtain a contiguous block B in M to be moved.If part of B has to go to S}+2 (i. e., to S'}), copy this piece directly.
212
Copy the rest of B into the buffer of Sj+ 1. If part of'this has to go
to Sj+l' copy it, and move the rest to the buffer of Sj' and, so on.The total time for all these moves is bounded by
j+3
:L2i+1 < i+5 == O(x).
i-O
•Proof O"tline . (of 1'ItIJorem 5.1 continued): Now, foroSiS log n - 2, consider a 2-level memory system, L(mi) such
i+lthat mi = 2 -1 and let 1j(mi) denote the expected number oftransfer steps that are required to achieve a permutation of n wordson L(m;). Using Lemmas 5.2 and 5.3, it follows that the expectedtime taken by any conservative algorithm on a BTx machine is atleast
logn-2
D( :L 2i
x Ti(m;».;==0
Consequently, the expected time taken to permute n words on aBTx machine is at least
logn-2
c x :L 2i
x min(n, (n/2i) log(n/2
i»;-0
which is n(n log2n).
•T1teoIWll 5.4: Any algorithm that sorts n words (using only com
parisons and block moves), computes an FFT graph or transposesa~ x~ matrix takes, n(n log2n) time on a BTx machine.
Proof Outline: A sorting algorithm can be used to perform an arbitrary permutation by suitably fixing the outcome of the comparisons; three FFT graphs can be cascaded to obtain a permutationnetwork. Thus, the lower bound for these two problems followsfrom Theorem 5.1. The lower bound for transposing a~ x~
matrix is obtained by SUitably modifying Lemma 5.2 and the modified Lemma can be obtained from [AV87].
•5.2 BOIIIIIIs for BTa Model With €X > 1
For €X > 1, the time bounds for various problems in the BTamodel are lis~~d in the table given below. The time required for
simple problems such as merging, shuffle-exchange permutation,and the touch problem is dominated by the time needed to accessthe farthest element (8(n
lX». The upper bounds for matrix trans-
position, computing FFT graphs, and sorting can be obtained bysimple divide-and-conquer algorithms. The upper bound for matrixmultiplication can be obtained by using the algorithm followingCorollary 4.3. The only nontrivial bound is the lower bound formatrix multiplication (using semiring operations only) and this isgiven below:
1'ItIJorem 5.5: For €X = 1.5,multiplying two n x n matrices usingonly ( x , + ) requires n(n
3 10g n) time on a BTa machine.
Proof Outline: Suppose two Vii x Vii matrices are stored in the
secondary memory of a two-level memory L(m) machine. Then,
it can be shown that any algorithm that uses only ( x , + ) takes
n(n31m3/2) transfer steps to multiply th~se matrices. Now, for
o SiS log n -1, consider a 2-level memory system, L(mi)' such
that mi = 2i+1 -1 and let T;(~i)'denote the minimum number of
transfer steps that are required to multiply two matrices on L(mi).
Using Lemmas 5.2 and 5.3 (which can be modified to apply to
BTxa, for a > 1, as well), it follows that the time taken by any al
gorithm that multiplies two matrices in BTxl.s-model and that uses
logn-l 3/2( x , + ) only is at least ~( .~ 2 I x .T~(mi»' Now, the resultfollows by noting that T;(mij-~ n(n
3/2
3'/ ) for multiplying two
matrices on L(mi).
•5.3 Optimal BOIInds for B1iog Model
The time bounds for various problems in the B1iog model are
listed in the table given below. The upper bounds for simple prob
lems such as merging, shuffle-exchange permutation, and the touch
problem are similar to those given in Theorem 2.2 except that the
data are now partitioned into blocks of size log n instead of n a. The
lower bound proof for these problems is also similar to the proof
of Theorem 2.1, and the modifications are explained below. The
upper bound of O(n log *n) for matrix transposition is obtained by
a divide-and-conquer algorithm and the lower bound follows be
cause matrix transposition requires all inputs to be separated from
their predecessors. The upper bound of O(n log n) for computing
FFT graphs, computing arbitrary permutations, and for sorting
follow from Theorems 3.3 and 3.4. Tsteps of a straight-line RAM
algorithm can be simulated in time O(T log T log *n by adapting
the algorithm given in section 4.2. A dictionary can be maintained
with search time O( log3n/ log log n) and with amortized update
time O( log n log *n) using a structure similar to that in section 4.1.
Another dictionary structure is obtained using a B-tree
[AHU83] where buckets have size 8( log n). Each bucket is or
ganized as a balanced search tree stored in contiguous memory lo
The one-to-one map from C to r is induced by a one-to-one
function h that maps C; into ('3;-2' '3;-1' '3;)' We abbreviateh([x - 1,x]"'[Y - 1,Y]) by h(l,x,y), and define h as follows:
(a) h(l, x, y) = (I, x, y) if I ~ 2.(b) h(O, x, y) = (x, y, y) if x, y ~ 2.(c) h(l, x, y) = (x, y, x).
(d) h(O, 1, y) = (Y, y + 1, y).
(e) h(O, x, 1) = (x + 1, x, x + 1).
It can be seen that h is one-to-one since 1< XJ'; x ~ y; and ifI = 1 then x is neither y - 1 nor y nor y + 1. It can also be checked
that for all i, 2t(Ci) ~ cost(h(Ci»·Now, to establish the claim that there are 3
1-
1sequences r with
cost(r) = t, let At = I {r Icost(r) = t} I , and observe that a sequencewith cost t is obtained either from a sequence with cost t - 1 byappending a 2, or from a cost t - 2 sequence by appending a 3 ora 4. and so on. This yields a recurrence
214
•
•
Theorem 5.8: In a B1iog machine, there is a constant c such that thenumber of distinct sequences of operations (block copy or other
wise) with a total time S t is at most 2ct
.
Proof: The proof follows as a direct extension of the above proof.
The constant c depends on the number of different kinds of oper
ations allowed in the B1iog machine.
One can also consider several extensions of the BT model. For
example, one could study other access functions such as step func
tions that may better reflect the physical situation of the memory
hierarchy in real machines. Another possibility is that the transfer
time for blocks could· be changed from one per word to some
function g(x) per word for copying from (or to) location x. Yet
another aspect that is significant for real memory hierarchies is
parallelism in block transfer (several reads from different disks may
take place simultaneously; transfers at different levels may proceed
concurrently, and can be overlapped with processing). Tradi
tionally, this represented one of the early uses of parallelism in
computers, and from a theoretical point of view, would seem nec
essary if we are to have data structures with good worst-case per-
formance. And, finally, in view of the importance of the memory
organization in multiprocessor machines, some appropriate model
for this would be nice. In any case, it is desirable that model ex
tensions remain clean and robust.
This paper can be thought of as a step in developing a theory
of computation that is aimed at data movement as against data
modification. It is too early to tell if such a
memory/ communication oriented (versus CPU oriented) theory
of computation will have any influence on pragmatic algorithms,
machine architecture, memory management, or language design.
Some generalities are beginning to emerge. It appears that
some problems (FFT, sorting, matrix multiplication) are very well
behaved in that their running time (on BT or HMM using different
access functions) usually equals the maximum of the RAM time
and the time to read (i. e., touch) the inputs in the hierarchical
memory (there seems to be a slight penalty when RAM time and
touch time are roughly equal). This is about the best that good lo
cality of reference could provide. Other problems (e. g. DAG sim
ulation) appear not to behave in this manner. We do not
understand what characterizes such behaviour.
directory algorithms for BTa be improved'! And, it seems impor
tant to study good memory management algorithms.
•
t-l+ 2 Ao where Ao = 1. It solves to
6. Conclusions
In this paper we have introduced a model for hierarchical
memory with block transfer. It is relatively clean and robust, but
nevertheless appears to mimic the behavior of real machines. Good
algorithms in this model utilize both temporal and spatial locality
of reference. The theory for this model appears to be rich and
deep.
There are a large number of possibilities for future research;
some are specific technical problems, other are more general issues.
The Corollary applies even to non-uniform algorithms that
perform block transfers or a finite set of other arbitrary powerful
operations; and in particular, to conservative algorithms. It re
quires, however, that the permutation to be achieved is not given
as an additional input. It may be noted that the lower bound results
of Theorems 5.8 and Corollary 5.9 also apply to any BTf machine
where f(x) = D( log x) (provided f(x) = 0 for at most one value
of x) and, in particular, they apply to BTa for any ex > o.
At = At- I + 2At- 2 +At = 3
t- for t ~ 1.
Corollary 5.9: The expected time for achieving a random permuta
tion on n elements on a B1iog machine is D(n log n).
Proof: A simple counting argument yields a lower bound of
(1/c) x n log n x (1 - 0(1» where c is the constant in Theorem
5.8.
In the BTa model, there is a sorting algorithm that takes
O(n log n log log n) time and O(n) space and another algorithm
that takes O(n log n) time and O(n log log n) space. Is it possible
to obtain an O(n log n) algorithm that takes only O(n) space? And,
are there problems for which there are space-time tradeoffs?
Permuting data is at the heart of many BT algorithms and it
would be nice to understand better the complexity of permutations.
For example, even in the BIiog model, we showed that most permutations require at least 0 (n log n) time. Can one demonstrate
such a permutation even for conservative algorithms?
There are numerous other areas, such as data structures, andgraph problems, that need to be analyzed. As an example, is it
possible to maintain directories in B1iog with O( log2n/ log log n)
search time and O( log n log *n) amortized update time? Can the
Acknowledgements: The authors wish to thank Michael Fisher forseveral useful suggestions.
References[AC86] R. C. Agarwal and J. W. Cooley, "Fourier Transform and
Convolution Subroutines for the IBM 3090 Vector Facility," IBM
J. of Research and Development, March 1986,145-162.
[AV87] A. Aggarwal and J. Vitter "The I/O Complexity of Sorting
Related Problems," Tech. Report, Dept. of Computer Science,Brown University, August 1987. Some results of this report can be
found in Proc. of the 14th Int. ColI. on Automata, Languages and
Programming, Karlsruhe, West Germany, July 1987,467-478.
[AACS87] A. Aggarwal, B. Alpern, A. K. Chandra and M. Snir,
"A Model for Hierarchical Memory," Proc. of the 19th Annual
215
ACM Symposium on the Theory of Computing, New York, 1987,305-314.
[AHU74] A. V. Aho, J. E. Hopcroft and J. D. Ullman, The Design
and Analysis of Computer Algorithms, Addison-Wesley, 1975.
[AHU83] A. V. Aho, J. E. Hopcroft and J. D. Ullman,es Data Structures -and Algorithms, Addison-Wesley, ReadingMass., 198.3.
[AIS83] B. Awerbuch, A. Israeli and Y. Shiloah, "Efficient Simulation of PRAM by Ultracomputer", Technical Report 120, IBMScientific Center, Haifa, Israel, May 1983.
[Ba80] J. L. Baer, Computer Systems Architecture, Computer Science Press, Potomac MD, 1980.
[BS80] J. L. Bentley and J. B. Saxe, "Decomposa~le Searchingproblems. I. Static-to-Dynamic Transformations," J. of Algo·rithIns, Dec. 1980, 301-358.
[De70] P. J. Denning, "Virtual Memory," ACM Computing Surveys, Sept. 1970, 153-189.
[F72) R. W. Floyd, "Permuting Information in Idealized TwoLevel Storage," In R. E. Miller and J. W. Thatcher (editors),Complexity of Computer Computations, 105-109, Plenum Press,New York, 1972.
[FP79] P. C. Fischer and R. L. Probert, "Storage ReorganizationTechniques for Matrix Computation in a Paging Environment,"Comm. ACM, Vol. 22, No.7, July 1979,405-415.
[G74] J. Gecsei, "Determining Hit Ratios in Multilevel Hierarchies," IBM J. of Research and Development, July 1974,316-327.
[HK81] J. W.Hong and H. T. Kung, "I/O Complexity: The RedBlue Pebble Game," Proc. of the 13th Ann. ACM Symposium onTheory of Computing, Oct. 1981, 326-333.
216
[K73] D. E. Knuth, The Art of Computer Programming; Vol. 3,
Sorting and Searching, § 5.5.9, Addison-Wesley, Reading Mass.,
1973.
[L82] F. T. Leighton, "A layout Strategy for VLSIWhich IsProvably Good," Proc. of the 14th Ann. ACM Symposium onTheory of Computing, Oct. 1982, 85-98.
[MGS70] R. L. Mattson, J. Gacsei, D. R. Slutz and I. L. Traiger,Evaluation Techniques for Storage Hierarchies," IBM Systems
Journal, 1970,78-117.
[MC69] A. C. McKellar and E. G. Coffmann Jr., "OrganizingMatrices and Matrix Operations for Paged Memory Systems,"Comm. ACM, Vol. 12, No.3, March 1969, 153-165.
[MC80] C. Mead and L. Conway, Introduction to VLSI Systems
and Related Systems, Addison-Wesley, Reading Mass., 1980,
pg.316.
[S84) A. Schonhage, "A Nonlinear Lower Bound for Random Access Machines Under Logarithmic Cost," Technical Report RJ
4527, IBM Almaden Research Laboratory, May 1984.
[Si83] G. M. Silberman, "Delayed-Staging Hierarchy Optimization," IEEE Trans. on Computers, TC-32, Nov. 1983, 1029-1037.
[SV82] Y. Shiloach and U, Vishkin, "An Of log n) ParallelConnectivity Algorithm," J. of Algorithms, 1982, 57-67.
[Sm86] A. J. Smith, "Bibliography and Readings on CPU CacheMemories and Related Topics," Compo Arch. News, Jan. 1986,22-42.
[T83] R. E. Tarjan, Data Structures and Network Algorithms, SIAM,Philadelphia Penn., 1983.
[W83] C. K. Wong, Algorithmic Studies in Mass Storage Systems,Computer Science Press, Rockville MD., 1983.