Software Pipelining in Pegasus/CASH Cody Hartwig Elie Krevat {chartwig,ekrevat}@cs.cm u.edu
Software Pipelining Software pipelining is a method for increasing the available
parallelism for instruction scheduling Data dependencies limit the opportunity for parallel execution Software pipelining can overlap loop iterations to increase
available operations to schedule between dependencies Many techniques exist [classification by Allan et al.]
Kernel recognition (e.g., Aiken & Nicolau) Assumes schedule for iterations are fixed, loop is unrolled n times Pattern recognition identifies a repeating kernel
Modulo scheduling Analysis of data dependencies (resource/precedence constraints) Finds minimum initiation interval to use when scheduling
Software Pipelining in Pegasus/CASH
Pegasus is an intermediate representation used by the CASH compiler Pegasus graph models control-flow and data-flow
Our Approach: Apply optimizations to the Pegasus graph, not the generated assembly Abstracts away resource constraints Feedback loop possible after scheduler and
register allocation (e.g., to implement less aggressive pipelining because of register spilling)
How Operations are Pipelined
Our approach computes operation outputs for future loop iterations in the current iteration Operations are copied into pre-header and the data-flow for
values before and after executing that operation are fed into the loop hyperblock
Then each loop iteration uses the value of the operation already computed, and computes the operation value for the next iteration
This approach is analogous to preparing temporary variables of future iterations to make the loop body schedule more efficient
Choosing Operations to Pipeline via Pattern Matching
An operation may be pipelined if it matches a number of possible patterns Patterns depend only on the type of operation and the
source of its inputs Operation type must allow speculative execution (e.g.,
loads are ok, but not stores)
Operations on the most expensive paths to etas are the first ones moved The most expensive path is not necessarily the longest
(e.g., a single ‘load’ operation is more expensive than two ‘add’ operations)
Recognized Patterns
Arithmetic Operation Load Operation Cast Operation
As operations are moved, new operations will form the recognized patterns
Example
int i = 0;
char a[100];
while(i < 100) {
char tmp = a[i];
tmp = tmp * 2;
a[i] = tmp;
i++;
}
The load and store are forced to execute in series
Operations in red are available to move
Evaluation – Moving Average
void move_avg(int *a){ int i = 1; while (i < l00) { int t1 = a[i]; int t2 = a[i-1]; a[i] = (t1+t2)/2; i++; }}
Schedule Length Statistics(after moving 11 operations)
Before After
Pre-header 8 14
Loop Body 22 18
Cost of entire function ≈ Cost(Pre-header) + 100*Cost(Loop Body)
Cost before Software Pipelining ≈ 2208
Cost after Software Pipelining ≈ 1814
Software Pipelining improves performance here by ≈ 18%