Efficient Compilation for Queue Size Constrained Queue ...web-ext.u-aizu.ac.jp/~benab/publications/journals/parco09/Preprint... · 9 tail positions. These pointers are referred to
Post on 20-Aug-2020
0 Views
Preview:
Transcript
Efficient Compilation for Queue Size
Constrained Queue Processors
Arquimedes Canedo a,∗ , Ben A. Abderazek b, Masahiro Sowa c
aIBM,
Tokyo Research Laboratory,
1623-14 Shimotsuruma, Yamato-shi
Kanagawa-ken 242-8502
Japan
bUniversity of Aizu,
Aizu-Wakamatsu, Fukushima-ken 965-8580,
Japan
cUniversity of Electro-Communications,
Graduate School of Information Systems,
Chofugaoka 1-5-1, Chofu-Shi 182-8585
Japan
Abstract
Queue computers use a FIFO data structure for data processing. The essential
characteristics of a queue-based architecture excel at satisfying the demands of
embedded systems, including: compact instruction set, simple hardware logic, high
parallelism, and low power consumption. The size of the queue is an important
concern in the design of a realizable embedded queue processor. We introduce the
relationship between parallelism, length of data dependency edges in data flow
Preprint submitted to Elsevier 30 September 2008
ManuscriptClick here to view linked References
graphs and the queue utilization requirements. This paper presents a technique
developed to make the compiler aware of the size of the queue register file and, thus,
optimize the programs to effectively utilize the available hardware. The compiler
examines the data flow graph of the programs and partitions it into clusters
whenever it exceeds the queue limits of the target architecture. The presented
algorithm deals with the two factors that affect the utilization of the queue, namely:
parallelism and the length of variables’ reaching definitions. We analyze how the
quality of the generated code is affected for SPEC CINT95 benchmark programs and
different queue size configurations. Our results show that for reasonable queue sizes
the compiler generates code that is comparable to the code generated for infinite
resources in terms of instruction count, static execution time, and instruction level
parallelism.
Key words: Queue Register File, Queue Processor, Constrained, Optimization,
Compiler
1 Introduction1
Queue-based computers are a novel and viable alternative for high-performance2
embedded systems and general purpose processors. Several queue computation3
models have already been proposed [1,2,43,38,42,33,46,35]. Queue computers4
use high speed registers organized as a first-in first-out (FIFO) queue to5
perform operations. Data is written, or enqueued, at the tail of the queue,6
and elements are read, or dequeued, at the head of the queue. Two hardware7
pointers are kept by the processor to track the position of the head and8
∗ Corresponding Author.
Email address: canedo@sowa.is.uec.ac.jp (Arquimedes Canedo).
URL: http://www.sowa.is.uec.ac.jp (Arquimedes Canedo).
2
tail positions. These pointers are referred to as QH and QT. Conventional9
compilation techniques for random access register machines cannot be utilized10
to generate queue programs since the FIFO queue demands strict ordering11
of operands and operations [10]. A level-order traversal of the data flow12
graph gives the order of which data should be enqueued, dequeued, and13
processed [35]. The queue computation model features unique characteristics14
that make it very attractive for addressing the problems of current computer15
design: simple hardware structures, low power consumption, and exploitation16
of instruction level parallelism. The instructions of a queue processor consist17
only of an opcode as operand reads and writes are done implicitly through the18
queue head and the queue tail pointers. Queue machines are very similar to19
stack machines [24,31], with the notable exception that the queue computation20
model does not suffer from the performance bottleneck created at the top21
of the stack [44]. Queue machines facilitate parallel execution of programs22
as one end of the queue is used exclusively for reading and the other end23
for writing. Furthermore, queue programs are generated from a level-order24
scheduling that exposes all parallelism available in the data flow graph.25
Having groups of data independent instructions permits the hardware to26
have a smaller instruction window than conventional RISC machines and27
potentially consume less power [6]. Another advantage of queue machines28
over conventional architectures is the lack of false dependencies in the29
program that are introduced by the tight coupling between the instruction30
set and the architected registers. Hence, queue processors do not need power31
hungry structures such as register renaming [43,2,1]. Furthermore, the implicit32
operand access allows instructions to have small encoding, simplifying the33
fetching and decoding, reducing the memory bandwidth, and reducing power34
consumption [7,40,15,16,23].35
3
At any point in the execution of a program, the queue length is the number36
of elements between QH and QT. Every statement in a program may have37
different queue length requirements and the hardware should provide enough38
words in the FIFO queue to hold the values and evaluate the expression. We39
developed the Queue Compiler Infrastructure [10] as part of the design space40
exploration tool-chain for the QueueCore processor [1]. The original queue41
compiler targets an abstract queue machine with unlimited resources including42
an infinite queue register file. Based on this assumption we measured the queue43
length requirements of the SPEC CINT95 applications. Figure 1 shows that44
95% of the statements in the programs require less than 32 queue words for45
their evaluation, and the remaining 5% demand a queue size between 32 and46
363 words. In our previous work [9] we gained insight into how queue length is47
mainly affected by two program characteristics: parallelism and soft edges. Soft48
edges represent the lifetime, in queue words, of reaching definitions of variables.49
Graphically a soft edge is an edge that spans across more than one level in50
the data flow graph. Table 1 shows the maximum queue requirements for the51
peak parallelism and maximum def-use length in SPEC CINT95 programs52
compiled for infinite queue. This table demonstrates that a reasonable and53
realizable amount of queue is needed in queue processors to execute the54
programs without performance penalty. However, assistance from the compiler55
is required to schedule the programs in such a way that parallelism and soft56
edges comply with the queue register file size in a realistic queue processor.57
This paper presents an optimizing compiler that is used to partition the data58
flow graphs of programs into clusters of constant parallelism and limited length59
of soft edges that can be executed in a queue processor with a limited queue60
register file. The compiler is also responsible for generating clusters that obey61
4
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
099.go 124.m88ksim 126.gcc 129.compress130.li 132.ijpeg 134.perl 147.vortex
Queue length
Acc
um
ula
tive
Per
centa
ge
Fig. 1. Queue size requirements. The graph quantifies the amount of queue
required to execute statements in the SPEC CINT95 benchmarks. A point, (x, y),
denotes that y% of the statements in the program require x, or less, queue words
to evaluate the expression.
the semantics of the queue computation model. The proposed algorithm was62
implemented in the queue compiler infrastructure [10,8] and affects compile-63
time by a negligible amount. The goal of this paper is to estimate how the64
characteristics of the output code are affected when the available queue is65
constrained. We estimate how the critical path, available parallelism, and66
program length of SPEC CINT95 benchmarks are affected for different size67
configurations of the queue register file. The contributions of this work are:68
• This is the first study, to the best of our knowledge, that estimates the69
performance of a queue processor with limited number of queue words.70
• The development of an efficient compiler algorithm that partitions the data71
flow graph into clusters that demand no more queue than what is available72
in the underlying architecture. This is achieved by limiting the parallelism73
and the length of reaching definitions in the data flow graph.74
Section 2 gives a summary of the related work. Section 3 introduces the75
queue computation model, a producer order queue processor, and provides76
5
Table 1
Characteristics of programs that affect the queue length in queue-based computers
Benchmark Peak Parallelism Max. def-use
099.go 20 19
124.m88ksim 29 19
126.gcc 35 56
129.compress 9 10
130.li 17 18
132.ijpeg 26 24
134.perl 15 15
147.vortex 49 14
the description of the queue compiler infrastructure. Section 4 presents the77
algorithm utilized to partition the data flow graph of programs into clusters78
of fixed queue utilization. An analysis of the experimental results is given in79
Section 5. Section 5 concludes this paper.80
2 Related Work81
The benefits and simplicity of 0-operand machines have been considered82
since the late 1950s. Several computers implement a LIFO stack in hardware83
to perform computations [5,17,24]. Several computer languages, compilers,84
interpreters, and virtual machines have been inspired by this computation85
model [36,29,27,22]. Practice has shown that the performance of the stack86
6
model is limited due to the bottleneck created at the top of the stack,87
which is the only place to read and write operands [44,39]. Opposite to stack88
computing, the queue computing model offers a parallel model with the same89
instruction format and characteristics. [9] showed that non-optimized queue90
code has about 13% more parallelism than optimized register code for an 8-91
way issue universal register machine. Surprisingly, only a handful of queue92
processors have been proposed in the literature.93
In [35], Preiss et al. proposed the first queue computer design together with the94
theory behind compiling for evaluating expressions for the queue computation95
model. They demonstrate that a level-order scheduling of an expression’s parse96
tree generates the sequence of instructions for correct evaluation. A level-order97
scheduling of directed acyclic graphs (DAG) still delivers the correct sequence98
of instructions but requires additional hardware support. This hardware99
modification is called an Indexed Queue Machine. The basic idea is to specify,100
for each instruction, the location with respect of the head of the queue (an101
index) where the result of the instruction will be used. An instruction may102
include several indexes if it has multiple parent nodes. The instruction format103
of the Indexed Queue Machine resembles a data flow computer [11]. All these104
ideas concerned an abstract machine until the hardware mechanism of a105
superscalar queue machine was proposed by Okamoto [33]. This superscalar106
queue machine realized the abstract design into a hardware implementation107
capable of executing instructions in parallel. The problem of generating code108
for a single queue has been demonstrated to be NP-Complete [18,19]. In [38],109
Schmit et al. proposed a heuristic algorithm to cover any DAG in one queue110
by adding special instructions to the data flow graph. From their experimental111
results, a large amount of additional instructions was reported, making this112
7
technique insufficient for achieving small code size. Despite the large amount113
of extra instructions, the resulting size of their tested programs is smaller than114
RISC design.115
The concept of an operand queue has been used as supporting hardware for116
the efficient execution of loops for two register-based processors. The WM117
architecture [46] is a register machine that reserves one of its registers for118
accessing the queue and demonstrates high streaming processing capabilities.119
The compiler for the WM machine [4] was developed to support streaming120
as an extension of accesss/execute computation model. In [42,14] the use of121
Register Queues (RQs) effectively reduces the register pressure on software122
pipelined loops. The compiler techniques developed for these processors rely123
on conventional register transfer intermediate languages, treating the queues124
as specific purpose set of registers.125
In our previous work [43,2,1], we investigated and designed a producer126
order parallel queue processor (QueueCore), which is capable of executing127
any data flow graph. Our model breaks the rule of dequeueing by allowing128
operands to be read from a different location than the head of the queue.129
This location is specified as an offset in the instruction. The fundamental130
difference with the Indexed Queue Machine is that our design specifies an131
offset reference in the instruction for reading operands instead of specifying132
an index to write operands. In the QueueCore’s instruction set, the writing133
location at the rear of the queue remains fixed for all instructions. To realize134
QueueCore as an actual processor we must explore how the file size of the135
queue register affects performance. None of the previous work related to136
queue computers has considered a constrained queue register file. We know137
from more than fifty years of experience in register machines that the size138
8
of the register file directly affects the overall performance of a computer139
system. Many works have proposed the optimization of the register file for140
the improvement of execution time [47,28,13], parallelism [45,34,30], power141
consumption [3,37,25,41], hardware complexity [21,20], etc [26,48].142
3 Overview of Queue Computing143
The Queue Computation Model (QCM) refers to the evaluation of expressions144
using a first-in first-out queue. Read and write operations are done through145
the head (QH) and tail (QT) of the queue. For every instruction, the hardware146
calculates the correct location of its operands, thus making correct execution147
possible. Queue length is the number of elements between QH and QT at a given148
state of execution. Queue length is tightly related to the way queue programs149
are generated. To evaluate any expression, the directed acyclic graph (DAG)150
of the expression should be scheduled in a level-order manner [35]. A level-151
order scheduling visits all nodes in a DAG from the deepest level towards the152
root as shown in Figure 2(a). All nodes belonging to the same level Li are153
data independent and can be executed in parallel. The queue length at any154
point in a program’s execution can be measured by the number of elements155
in level Li. Figure 2(c) shows the position of QH and QT for each level in the156
expression. The queue length requirements are also shown in the Figure. For157
L3, the queue length requirement is three, L2 is two, L1 is one, and L0 is158
zero. The arithmetic instructions encode the two operands that specify the159
location with respect to the QH from where the operands should be read. This160
architectural modification is necessary to allow the execution of any data flow161
graph and the details are discussed somewhere else in [43,2]. This kind of162
9
queue processor is called a producer order queue processor.163
a b c
+ -
/
x
L3
L2
L1
L0
(a) Level order traversal of parse tree (b) Queue program
ld a
ld b
ld c
add 0, 1
sub -1, 0
div 0, 1
st x
a b cL3
a b c a+b b-c
QH QT
QH QT
L2
a b c
QH QT
L1
a b c a+b b-c
QHQT
(a+b)/(b-c)L0
3
2
1
(c) Queue contents at each execution level
a+b b-c (a+b)/(b-c)
Fig. 2. Queue program characteristics. (a) Level-order traversal of a DAG. (b)
producer order queue instructions. (c) queue contents and queue length at each
execution level.
3.1 Target Architecture: QueueCore processor164
The QueueCore processor [1] implements a producer order instruction set165
architecture. Each instruction can encode a maximum of two operands that166
specify the location to read the operands from in the queue register file.167
For each instruction, the processor determines the physical location of the168
operands by adding the offset reference in the instruction to the current169
position of the QH pointer. A special unit called the Queue Computation170
Unit is in charge of finding the physical location of source operands and their171
destination within the queue register file, thus allowing parallel execution of172
instructions. Every instruction of the QueueCore is 16-bit wide. For cases173
where there are insufficient bits to express large constants, memory offsets,174
10
or offset references, a covop instruction is inserted. This special instruction175
extends the operand field of the following instruction by concatenating it to176
its operand. The file size of the queue register of the QueueCore processor is177
set to 256 words.178
3.2 Compiling for QueueCore179
In our previous work [8], we developed the compiler infrastructure for the180
QueueCore processor [1]. The compiler translates C programs into QueueCore181
assembly. An important task of the queue compiler is to schedule the182
program in a level-order manner and correctly compute the offset instruction183
references [10]. Figure 3 shows the block diagram of the queue compiler,184
including the proposed clusterization pass. The front-end of the compiler is185
based on GCC 4.0.2 and it parses C files into abstract syntax trees (AST),186
using a high level intermediate representation called GIMPLE [32]. GIMPLE187
representation is a three-address code suitable for code generation for register188
machines but not queue machines. As the level-order scheduling traverses the189
full DAG of an expression, we reconstruct GIMPLE trees into trees of arbitrary190
depth and width called QTrees. During the expansion of GIMPLE into191
QTrees, we also translate the high-level constructs such as aggregate types into192
their low level representation in QueueCore instructions. After having fully193
expanded DAGs into QTrees, we build the core data structure, or the leveled194
DAGs (LDAG) [18]. LDAGs allow the code generation algorithm of the queue195
compiler to determine the offset references for every instruction. The offset196
calculation phase consists of determining, for every instruction, the current197
location of the QH and measuring the distance to the instruction’s operands.198
11
The measured distance is the offset reference value and it is annotated in199
the LDAG. Then the LDAGs are level-order scheduled and a linear low-level200
intermediate representation (QIR) is emitted. The last phase of the compiler201
the generation of assembly code from the QIR representation.202
Under the queue computation model, the queue utilization requirements are203
given by the number of nodes at every level of computation. In the following204
section, we propose an algorithm that partitions the expressions into clusters205
to reduce queue length in such a way that the rules of queue computing206
are preserved. We chose LDAGs to be the input of the algorithm, since all207
dependency edges between operations and operands are determined.208
Front-End
1-offset CodeGeneration
OffsetCalculation
InstructionScheduling
C source file
QueueCoreAssembly
ASTs
QTrees
LeveledDAGs
QIR
AssemblyGeneration
Queue C
om
pile
r In
frastr
uctu
re
Clusterize
Fig. 3. Queue Compiler Infrastructure. Block diagram showing the phases and
intermediate representations used along the compilation.
12
4 Algorithm for Queue Register File Constrained Compilation209
Queue length refers to the number of elements stored between QH and QT210
at some computation point. We have introduced how queue length can be211
determined by counting the number of elements in a computation level.212
Nevertheless, DAGs often present a case when this assumption is not enough213
to estimate the queue length requirements of an expression. Consider the DAG214
shown in Figure 4 for the multiply-accumulate operation commonly used in215
signal processing “y[n] = y[n] + x[i] * b[i]”. Notice that some edges span more216
than one level (soft edges). For example, the edge with the source at node217
“sst” and the sink at node ’+’ of L3 spans three levels. Soft edges increase218
the requirements of the queue length, since the sink node must be kept in219
the queue until the time when the source node is executed. For the given220
example, the maximum queue length requirement of the DAG is five queue221
words and is imposed by the longest soft edge. The algorithm must deal with222
two different conditions that directly affect the queue length requirements of223
an expression: the length of computation levels, and the length of soft edges.224
The first can be solved by splitting the level into manageable sized blocks.225
The second can be solved by re-scheduling the child’s subtree. The order in226
which these actions are performed affects the quality of the output DAGs and,227
therefore, the quality of the generated code.228
If the levels are first split and then the subtrees are re-scheduled, the second229
action affects the length of the levels in the final DAG. The first transformation230
should be performed one more time to guarantee that all levels comply with the231
target queue length. If the order of the actions is inverted, then the DAG will232
be expanded into a tree and all subexpressions have to be recomputed since233
13
all subtrees are completely expanded, thus affecting performance and code234
size due to the redundant instructions. We propose an algorithm that deals235
with these problems in a unified manner. Our integrated solution reduces the236
subexpression re-scheduling and minimizes the insertion of spill code.237
sst
n
+
*lds
+
y * +
x *
i
lds
+
b
lds
size
ld i
ldi size
ld n
lea x
mul 0, 1
lea b
lea y
mul 0,-1
add 0, 1
add 0,-1
add 0, 1
lds 0
lds 0
lds 0
mul 0, 1
add 0, 1
sst -5, 0
L0
L1
L2
L3
L4
L5
L6
Fig. 4. Queue length is determined by the width of levels and length of soft edges.
4.1 Data Flow Graph Clusterization238
The main task of the clusterization algorithm is to reduce a DAG’s queue239
length requirements by splitting it into clusters of specified size. The algorithm240
must partition the DAG in such a way that every cluster is semantically241
correct in terms of the queue-computing model. Partitioning involves the242
addition of extra code to communicate the intermediate values computed in243
different clusters. Our algorithm uses memory to communicate intermediate244
values between clusters. The input of the algorithm is a LDAG data structure.245
For the queue compiler, a cluster is defined as a LDAG with spill code that246
communicates intermediate values to other clusters through memory. Keeping247
clusters as LDAGs allows the implementation to use the same infrastructure248
and the later phases of the queue compiler remain free of modification.249
14
The algorithm is divided into two phases: labeling and spill insertion. The250
labeling phase groups the DAG subtrees into clusters in order to preserve251
the rules of queue computing. For any given DAG or subtree W rooted at252
node R, the width of W is verified to be smaller than the threshold. The253
threshold is the size of queue register file for which the compiler should254
generate constrained code. If the condition is true then all nodes in W are255
labeled with a unique identifier called the cluster ID. In cases where the width256
of the DAG exceeds the threshold, the DAG must be recursively partitioned257
in a post-order manner, i.e. starting from the left child and then the right258
child of R. The labeling algorithm is listed in Algorithm 1. To measure the259
width of a subtree W , the DAG is traversed as a tree and the level with more260
elements is considered the width of W . Notice that when the DAG rooted at261
“sst” in Figure 4 is traversed as a tree, the maximum width is encountered262
in level L5 with six elements corresponding to nodes n, size, x, ∗, b, ∗. For263
simplicity of the explanation, assume that the threshold equals 2. Since264
SubTree Width(sst) > Threshold, the partitioning algorithm recurses on the265
left hand side node “+” at L3. The width of the “+” subtree is 2, equal to266
the threshold. Thus, all nodes belonging to the subtree rooted at “+” node at267
L3 are marked with cluster ID = 1 by line 14 of Algorithm 1. The algorithm268
continues with the rest of the DAG until all nodes have been traversed and269
assigned to a cluster. The output of the labeling phase is a labeled DAG as270
shown in Figure 5. Four clusters are shown in the Figure, the first cluster has271
its root node at (+) with its node in L3, the second cluster is rooted at the272
(lds) with its node in L2, the third cluster is at the (lds) with its node in L3,273
and the fourth cluster is at (sst) with its node in L0.274
The spill insertion phase is the second and final phase of the algorithm. The275
15
sst
n
+
*lds
+
y * +
x *
i
lds
+
b
lds
size
L0
L1
L2
L3
L4
L5
L6
Fig. 5. Output of the labeling phase of the clusterization algorithm
annotated LDAG from the previous phase is processed and a list of N number276
of clusters (a cluster set) is generated as the output. The input annotated DAG277
is traversed in a post-order manner. For every node visited in the traversal,278
a set of actions are performed to: (1) assign the node to the corresponding279
cluster, (2) insert reload operations to retrieve temporaries computed in a280
different cluster, (3) insert operations to spill temporaries used by different281
clusters.282
Assigning nodes to the corresponding cluster involves the creation of: LDAG283
data structures, node information, and data dependency edges. Using the284
queue compiler’s LDAG infrastructure [8] allows the clusterization algorithm285
to be implemented in a clean and simple manner. In terms of memory286
complexity, the addition of a list of length N is required in the compiler to287
generate the clusters. The value of N is the number of clusters discovered by288
the labeling phase.289
Spill code is inserted in two situations: to deal with intermediate results used290
by different clusters and to solve the problem of soft edges that span more than291
16
Algorithm 1 labelize (LDAG W )
Require: Threshold
1: root ← W’s root node
2: if SubTree Width (root) > Threshold then
3: lhs ← labelize (root.lhs)
4: if SubTree Width (root.rhs) > Threshold then
5: rhs ← labelize (root.rhs)
6: root ← Assign ID to node (rhs.id)
7: return root
8: else
9: root.rhs ← Assign ID to subtree (root.rhs);
10: root ← Assign ID to node (root.rhs);
11: return root
12: end if
13: else
14: root ← Assign ID to subtree (root);
15: return root
16: end if
one level and demand more queue than what was specified by the threshold292
(same or different clusters). Only subexpressions are spilled to memory and293
reloaded. Variables and constants that are used by multiple nodes are only294
reloaded since spill/reload would require extra instruction and extra memory295
space for temporaries. For every node the algorithm detects which operation u296
needs operands v to be reloaded whenever the cluster identifier of the node and297
the operand are different ID(u) 6= ID(v), or soft edges larger than threshold298
with source at node u exist. After detection of reloading, the node u is analyzed299
for spilling as follows. If the analyzed node u is a subexpression and has more300
17
than one parent node then a spill operation is inserted.301
Figure 6 shows the clusters generated for the example in Figure 5. Four302
clusters are generated after the spill code is inserted. The gray nodes in303
the figure represent the nodes that are spilled to memory. Notice that the304
node (size) has two parents in Figure 5 but no temporary is generated305
as it is not a subexpression but a constant known at time of compilation.306
The rectangle nodes represent the reload operations needed to retrieve the307
computed subexpressions from other clusters, variables, and constants. All308
four clusters in the figure comply with the requirement of not exceeding queue309
utilization greater than two. For this example, the penalty of compiling for a310
queue register file size of two words is the insertion of ten extra instructions:311
four spills and six reloads.312
+
y *
n size
lds
+
x *
i
lds sst
*
+
b
lds
tmp3
size tmp4
tmp1 tmp2
tmp1 +
tmp2
tmp4
tmp1
tmp3
Cluster 1 Cluster 2 Cluster 3
Cluster 4
Fig. 6. Output of the clusterization algorithm. Spill nodes are marked with gray
circles and reload operations are represented by rectangles.
Algorithm 2 lists the actions performed to generate a set of clusters over the313
annotated LDAG. As clusters have the same shape as LDAGs, we can use the314
queue compiler infrastructure to generate code directly from the cluster set.315
18
Each cluster is treated as a LDAG and the code generator [8] calculates the316
offset references for all instructions, including spill code. Additionally, from317
the described clusterization algorithm, the queue compiler internals remained318
untouched and the compilation flow remains the same as the original compiler.319
Clusters are connected to each other by data dependency edges. The order in320
which the clusters are scheduled is very important to preserve correctness of321
the program. We built a cluster dependence graph (CDG) to facilitate code322
generation. The CDG for the example given above is shown in Figure 7. At323
first, cluster 1 must be scheduled for execution, followed by clusters 2 and324
3, and finally cluster 4. Some clusters are independent from each other, like325
clusters 2 and 3, and can be scheduled in any order. In this paper, we schedule326
the clusters in the same order as they are discovered by the labeling algorithm.327
However, we notice here that this may present an opportunity for further328
optimization.329
1
32
4
Fig. 7. Cluster Dependence Graph (CDG)
5 Results330
The primary concern of this study was to analyze how the quality of the331
generated programs is affected when the program is constrained with a332
limited number of queue words. We concentrate on three aspects of the333
output programs: (1) instruction count, (2) critical path, and (3) instruction334
19
Algorithm 2 clusterize (node u, LDAG W )
Require: An empty cluster set C of N elements
1: /* Traverse as a DAG */
2: if AlreadyVisited (u) then
3: return NIL
4: end if
5: /* Action 1: add to corresponding cluster */
6: ClusterSet Add (C, ID(u), u)
7: /* Action 2: generate reloads */
8: for all children v of u do
9: if ID(v) 6= ID(u) then
10: GenReload (C, ID(u), v)
11: else if isSoftEdge (u, v) AND EdgeLength (u, v) > Threshold then
12: GenReload (C, ID(u), v)
13: else
14: /* Post-Order traversal */
15: clusterize (v, W)
16: end if
17: end for
18: /* Action 3: generate spills */
19: if Parents (u) > 1 AND isSubexpression (u) then
20: GenSpill (C, ID(u), u)
21: end if
22: /* Mark visited and return */
23: MarkVisited (u)
24: return u
20
level parallelism. Instruction count is the number of generated instructions335
including spill code and reloads. The critical path refers to the height of the336
program’s data flow graph given by the number of queue computation levels.337
This metric provides a compile-time estimation of the execution time of the338
program in a parallel queue processor. The instruction level parallelism in339
a queue system is estimated as the average number of instructions on every340
computation level of the data flow graph of a program.341
The methodology used to perform the experiments is as follows. We suc-342
cessfully implemented the presented algorithm in the queue compiler in-343
frastructure [8]. The threshold input value for the algorithm is given as a344
compiler option. For all experiments, the compiler was only configured with345
the clusterization algorithm presented. No other optimizations are currently346
available in the queue compiler. We compiled SPEC CINT95 benchmark347
programs [12] with threshold values of 2, 4, 8, 16, 32, and infinity.348
Table 2 quantifies the compilation time cost of the presented algorithm. The349
second column (LOC) shows the lines of C code for the input programs. The350
third column (Baseline) shows the compilation time with the constrained com-351
pilation disabled. The rightmost column (Constrained) shows the compilation352
time taken by the queue compiler with the constrained compilation enabled353
and a threshold set to 2. This threshold value is the worst-case configuration for354
the algorithm as the available queue is only two words. The table demonstrates355
that the penalty of this optimization negligibly affects the complexity of the356
queue compiler. The compilation time is the real-time of a dual 3.2 GHz Xeon357
computer running GNU/Linux 2.6.20. The compiler was bootstrapped with358
debugging facilities, and no optimizations were added.359
21
Table 2
Estimation of constrained compilation complexity measured as compile-time for the
SPEC CINT 95 benchmark programs with threshold set to two.
Benchmark LOC Baseline Constrained
099.go 28547 9.34s 9.35s
124.m88ksim 17939 9.58s 9.69s
126.gcc 193752 42.67s 43.39s
129.compress 1420 0.37s 0.38s
130.li 6916 3.20s 3.27s
132.ijpeg 27852 8.88s 9.10s
134.perl 23678 6.92s 7.26s
147.vortex 52633 18.73s 19.02s
5.1 Qualitative Analysis of the Output Code360
5.1.1 Instruction Count361
The most evident effect of the clusterization algorithm in the output code362
is in the instruction count. Spill code is inserted whenever the width of a363
level or a soft edge exceeds the threshold value. Table 8 shows the normalized364
instruction count for the benchmark programs for different queue lengths. The365
baseline is the programs that were compiled for infinite resources (INFTY),366
where clusterization is not present. We selected various lengths of queue for367
the following reasons. The most restrictive configuration for a queue processor368
is a queue length of 2. This configuration estimates the worst-case conditions369
22
for compilation and may strongly affect the quality of the programs. The370
other three chosen queue lengths (threshold = 4, 8, 16) are values above371
the average available parallelism in non-optimized SPEC CINT95 programs.372
The relationship between queue length and available parallelism is that N373
parallel instructions consume a maximum of 2N queue length. However, peak374
parallelism and some soft edges are beyond these values and our algorithm375
found opportunities for clusterization. The last length of queue is set at infinity376
to compare the previous constrained configurations against an ideal hardware.377
As we expected, the most restrictive queue length configuration was incurred378
at 2 with the most substantial insertion of spill code ranging from 3% to 11%379
more instructions. The clusterization algorithm works on the premise that the380
width of the data flow graph, or degree of parallelism, must be partitioned in381
case its queue requirements violate the available queue. Therefore, compilation382
for a queue length of two words forces a large number of partitions of the383
original data flow graph, thus inserting a substantial amount of spill code. For384
the queue lengths of 4 and 8, the increase in number of instructions is about 2%385
and 1%, respectively. Compilation for a queue length of 4 words exceeds the386
average queue requirements of SPEC CINT95 programs, which is about 3.5387
queue words per level. When compiling for a queue length of 16, the insertion388
of spill code is insignificant for most of the programs. These rare cases that389
demand more than 16 queue words are the bursts of peak parallelism and long390
soft edges.391
23
0.900
0.955
1.010
1.065
1.120
099.go
124.m88
ksim
126.gc
c
129.co
mpr
ess
130.li
132.ijp
eg
134.pe
rl
147.vo
rtex
AVG
2 4 8 16 INFTY
Num
ber
of In
stru
ctio
ns
Fig. 8. Normalized instruction count measurement for queue lengths,
threshold = 2, 4, 8, 16, INFTY .
5.1.2 Spill Code Distribution392
We separated the inserted spill code into three components, as shown in393
Figure 9: parallelism, soft edges, and reloads. Parallelism accounts for all394
spill instructions inserted to constrain the width of the data flow graph395
to the available queue. Soft edges represent all spilled temporaries that396
were generated to constrain the soft edges that exceed the available queue.397
Reloads are the instructions to read the spilled temporaries and uses of398
shared constants and variables of other clusters. The Figure quantifies the399
contribution of each component of spill code into the total number of400
extra instructions for the 124.m88ksim benchmark. When only considering401
the two components that contribute with spill code (parallelism, and soft402
edges) and ignoring the reload instructions, notice that for a queue size of403
2, the parallelism component dominates with 89% extra code. The other404
11% is from the soft edge component. As explained above, the parallelism405
component contributes most of the code since a large number of partitions406
must be made given the configuration and benchmarks. When a larger407
24
queue size configuration is imposed and exceeds the average queue utilization408
of the compiled program, the distribution changes, and, on average, the409
parallelism component contributes 40%, and soft edges component contributes410
the remaining 60% of spill code.411
0
1,125
2,250
3,375
4,500
2 4 8 16
Parallelism Soft edges Reloads
Queue Length
Num
ber
of In
stru
ctio
ns
Fig. 9. Spill code distribution of 124.m88ksim benchmark.
5.1.3 Critical Path412
We define the critical path of a program as the number of computation levels413
in its data flow graph. These levels represent the true data dependencies of the414
program’s data flow graph and the limits of a queue processor. Assuming that415
all instructions in every level are executed in parallel by the queue processor,416
the execution time is bounded by the number of levels in the program. We417
use the critical path to estimate the static execution time of the compiled418
programs. Since partitioning the data flow graph into clusters increases the419
number of levels, we were interested in determining how the static execution420
time is affected when compiling for a constrained queue register file. Figure 10421
shows the experimental results for different queue sizes. For queue sizes of 4,422
8, 16 the performance degradation of the static execution time is less than 1%423
25
top related