8th Annual Workshop on Duplicating, Deconstructing, and Debunking http://www.ece.wisc.edu/~wddd Held in conjunction with the 36 th Annual International Symposium on Computer Architecture (ISCA-36) Austin, Texas June 21, 2009 Workshop Organizers: Bryan Black, AMD, ([email protected]) Natalie Enright Jerger, University of Toronto, ([email protected]) Gabriel Loh, Georgia Tech, ([email protected])
39
Embed
8th Annual Workshop on Duplicating, Deconstructing, and … · Henry Wong University of Toronto [email protected] Tor M. Aamodt University of British Columbia [email protected]
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
8th Annual Workshop on Duplicating, Deconstructing,
and Debunking http://www.ece.wisc.edu/~wddd
Held in conjunction with the 36th Annual International Symposium
on Computer Architecture (ISCA-36)
Austin, Texas June 21, 2009
Workshop Organizers:
Bryan Black, AMD, ([email protected]) Natalie Enright Jerger, University of Toronto,
Eighth Annual Workshop on Duplicating, Deconstructing, and Debunking
Sunday, June 21 2009
Final Program
Session I: 1:30 -3:30 PM
The Performance Potential for Single Application Heterogeneous Systems Henry Wong and Tor Aamodt University of Toronto and University of British Columbia
Our Many-Core Benchmarks Do Not Use that Many Cores Paul Bryan, Jesse Beu, Thomas Conte, Paolo Faraboschi and Daniel Ortega Georgia Institute of Technology and HP Labs
Session II: 4:00 – 5:30 PM
Considerations for Mondriaan-like Systems Emmett Witchel University of Texas at Austin
Is Transactional Memory Programming Actually Easier? Christopher Rossbach and Owen Hofmann and Emmett Witchel University of Texas at Austin
A consideration of Amdahl’s Law [9] suggests asingle-chip multiprocessor with asymmetric cores is apromising way to improve performance [16]. In thispaper, we conduct a limit study of the potential benefitof the tighter integration of a fast sequential core de-signed for instruction level parallelism (e.g., an out-of-order superscalar) and a large number of smaller coresdesigned for thread-level parallelism (e.g., a graphicsprocessor). We optimally schedule instructions acrosscores under assumptions used in past ILP limit studies.We measure sensitivity to the sequential performance(instruction read-after-write latency) of the low-costparallel cores, and latency and bandwidth of the com-munication channel between these cores and the fastsequential core. We find that the potential speedup oftraditional “general purpose” applications (e.g., thosefrom SpecCPU) as well as a heterogeneous workload(game physics) on a CPU+GPU system is low (2.2×to 12.7×), due to poor sequential performance of theparallel cores. Communication latency and bandwidthhave comparatively small performance impact (1.07× to1.48×) calling into question whether integrating ontoone chip both an array of small parallel cores and alarger core will, in practice, benefit the performance ofthese workloads significantly when compared to a sys-tem using two separate specialized chips.
1 Introduction
As the number of cores integrated on a single chipcontinues to increase, the question of how useful ad-ditional cores will be is of intense interest. Recently,Hill and Marty [16] combined Amdahl’s Law [9] andPollack’s Rule [28] to quantify the notion that single-
∗Work done while the first author was at the University of
British Columbia.
chip asymmetric multicore processors may provide bet-ter performance than using the same silicon area for asingle core or some number of identical cores. In thispaper we take a step towards refining this analysis byconsidering real workloads and their behavior sched-uled on an idealized machine while modeling commu-nication latency and bandwidth limits.
Heterogeneous systems typically use a traditionalmicroprocessor core optimized for extracting instruc-tion level parallelism (ILP) for serial tasks, while of-floading parallel sections of algorithms to an array ofsmaller cores to efficiently exploit available data and/orthread level parallelism. The Cell processor [10] is aheterogeneous multicore system, where a traditionalPowerPC core resides on the same die as an arrayof eight smaller cores. Existing GPU compute sys-tems [22, 2] typically consist of a GPU with a discreteGPU attached via a card on a PCI Express bus. Al-though development of CPU-GPU single-chip systemshas been announced [1], there is little published infor-mation quantifying the benefits of such integration.
One common characteristic of heterogeneous multi-core systems employing GPUs is that the small mul-ticores for exploiting parallelism are unable to executea single thread of execution as fast as the larger se-quential processor in the system. For example, recentGPUs from NVIDIA have a register to register read-after-write latency equivalent to 24 shader clock cy-cles [25]1. This latency is due in part to the use of finegrained multithreading [32] to hide memory access andarithmetic logic unit latency [13]. Our limit study isdesigned to capture this effect.
While there have been previous limit studies onparallelism in the context of single-threaded machines[7, 17, 15], and homogeneous multicore machines [21], aheterogeneous system presents a different set of trade-
1The CUDA programming manual indicates 192 threads are
required to hide read-after-write latency within a single thread,
there are 32-threads per warp, and each warp is issued over four
clock cycles.
offs. It is no longer merely a question of how much par-allelism can be extracted, but also whether the paral-lelism is sufficient considering the lower sequential per-formance (higher register read-after-write latency) andcommunication overheads between processors. Fur-thermore, applications with sufficient thread-level par-allelism to hide communication latencies may diminishthe need for a single-chip heterogeneous system exceptwhere system cost considerations limit total silicon areato that available on a single-chip.
This paper makes the following contributions:
• We perform a limit study of an optimistic hetero-geneous system consisting of a sequential proces-sor and a parallel processor, modeling a traditionalCPU and an array of simpler cores for exploitingparallelism. We use a dynamic programming algo-rithm to choose points along the instruction tracewhere mode switches should occur such that thetotal runtime of the trace, including the penaltiesincurred for switching modes, is minimized.
• We show the parallel processor array’s sequen-tial performance (read-after-write latency) rela-tive to the performance of the sequential processor(CPU core) is a significant limitation on achievablespeedup for a set of general-purpose applications.Note this is not the same as saying performanceis limited by the serial portion of the computa-tion [9].
• We find that latency and bandwidth between thetwo processors have comparatively minor effectson speedup.
In the case of a heterogeneous system using a GPU-like parallel processor, speedup is limited to only 12.7×for SPECfp 2000, 2.2× for SPECint 2000, and 2.5× forPhysicsBench [31]. When connecting the GPU using anoff-chip PCI Express-like bus, SPECfp achieves 74%,SPECint 94%, and PhysicsBench 82% of the speedupachievable without latency and bandwidth limitations.
We present our processor model in Section 2,methodology in Section 3, analyze our results in Sec-tion 4, review previous limit studies in Section 5, andconclude in Section 6.
2 Modeling a Heterogeneous System
We model heterogeneous systems as having two pro-cessors with different characteristics (Figure 1). Thesequential processor models a traditional processor coreoptimized for ILP, while the parallel processor mod-els an array of cores for exploiting thread-level paral-
SequentialProcessor
SequentialProcessor
ParallelProcessor
ParallelProcessor
MemMem Mem
a. b.
➊
➋
Figure 1. Conceptual Model of a Heteroge-neous System. Two processors with differentcharacteristics (a) may, or (b) may not sharememory.
lelism. The parallel processor models an array of low-cost cores by allowing parallelism, but with a longerregister read-after-write latency than the sequentialprocessor. The two processors may communicate overa communication channel whose latency is high andbandwidth is limited when the two processors are onseparate chips. We assume that the processors are at-tached to ideal memory systems. Specifically, for de-pendencies between instructions within a given core(sequential or parallel) we assume store-to-load com-munication has the same latency as communication viaregisters (register read-after-write latency) on the samecore. Thus, the effects of long latency memory accessfor the parallel core (assuming GPU-like fine-grainedmulti-threading to tolerate cache misses) is captured inthe long read-to-write delay. The effects of caches andprefetching on the sequential processor core are cap-tured by its relatively short read-to-write delay. Wemodel a single-chip system (Figure 1(a)) with sharedmemory by only considering synchronization latency,potentially accomplished via shared memory (➊) andon-chip coherent caches [33]. We model a system withprivate memory (Figure 1(b)) by limiting the commu-nication channel’s bandwidth and imposing a latencywhen data needs to be copied across the link (➋) be-tween the two processors.
Section 2.1 and 2.2 describe each portion of ourmodel in more detail. In Section 2.3 we describe ouralgorithm for partitioning and scheduling an instruc-tion trace to optimize its runtime on the sequentialand parallel processors.
2.1 Sequential Processor
We model the sequential processor as being ableto execute one instruction per cycle (CPI of one).This simple model has the advantage of having pre-dictable performance characteristics that make the op-timal scheduling (Section 2.3) of work between sequen-tial and parallel processors feasible. It preserves the
essential characteristic of high-ILP processors that aprogram is executed serially, while avoiding the mod-eling complexity of a more detailed model. Althoughthis simple model does not capture the CPI effects of asequential processor which exploits ILP, we are mainlyinterested in the relative speeds between the sequentialand parallel processors. We account for sequential pro-cessor performance due to ILP by making the parallelprocessor relatively slower. In the remainder of thispaper, all time periods are expressed in terms of thesequential processor’s cycle time.
2.2 Parallel Processor
We model the parallel processor as a dataflow pro-cessor, where a data dependency takes multiple cy-cles to resolve. This dataflow model is driven by ourtrace based limit study methodology described in Sec-tion 3.2, which assumes perfectly predicted branchesto uncover parallelism. Using a dataflow model, weavoid the requirement of partitioning instructions intothreads, as done in the thread-programming model.This allows us to model the upper bound of parallelismfor future programming models that may be more flex-ible than threads.
The parallel processor can execute multiple instruc-tions in parallel, provided data dependencies are satis-fied. Slower sequential performance of the parallel pro-cessor is modeled by increasing the latency from thebeginning of an instruction’s execution until the timeits result is available for a dependent instruction. Wedo not limit the parallelism that can be used by theprogram, as we are interested in the amount of paral-lelism available in algorithms.
Our model can represent a variety of methods ofbuilding parallel hardware. In addition to an arrayof single-threaded cores, it can also model cores us-ing fine-grain multithreading, like current GPUs. Notethat modern GPUs from AMD [3] and NVIDIA [23,24] provide a scalar multithreaded programming ab-straction even though the underlying hardware issingle-instruction, multiple data (SIMD). This execu-tion mode has been called single-instruction, multiplethread (SIMT) [14].
In GPUs, fine-grain multithreading creates the il-lusion of a large amount of parallelism (>10,000s ofthreads) with low per-thread performance, althoughphysically there is a lower amount of parallelism (100sof operations per cycle), high utilization of the ALUs,and frequent thread switching. GPUs use the largenumber of threads to “hide” register read-after-writelatencies and memory access latencies by switching toa ready thread. From the perspective of the algorithm,
a GPU appears as a highly-parallel, low-sequential-performance parallel processor.
To model current GPUs, we use a register read-after-write latency of 100 cycles. For example, cur-rent Nvidia GPUs have a read-after-write latency of 24shader clocks [25] and a shader clock frequency of 1.3-1.5 GHz [23, 24]. The 100 cycle estimates includes theeffect of instruction latency (24×), the difference be-tween the shader clock and current CPU clock speeds(about 2×), and the ability of current CPUs to ex-tract ILP—we assume an average IPC of 2 on currentCPUs, resulting in another factor of 2×. We ignorefactors such as SIMT branch divergence [8].
We note that the SPE cores on the Cell processorhave comparable read-after-write latency to the moregeneral purpose PPE core. However, the SPE cores arenot optimized for control-flow intensive code [10] andthus may potentially suffer a higher “effective” read-after-write latency on some general purpose code (al-though quantifying such effects is beyond the scope ofthis work).
2.3 Heterogeneity
We model a heterogeneous system by allowing analgorithm to choose between executing on the sequen-tial processor or parallel processor and to switch be-tween them (which we refer to as a “mode switch”).We do not allow concurrent execution of both proces-sors. This is a common paradigm, where a parallelsection of work is spawned off to a co-processor whilethe main processor waits for the results. The runtimedifference for optimal concurrent processing (e.g., as inthe asymmetric multicore chips analysis given by Hilland Marty [16]) is no better than 2× compared to notallowing concurrency.
We schedule an instruction trace for alternating ex-ecution on the two processors. Execution of a trace oneach type of core was described in Sections 2.1 and 2.2.For each mode switch, we impose a “mode switch cost”,intuitively modeling synchronization time during whichno useful work is performed. The mode switch cost isused to model communication latency and bandwidthas described in Sections 2.3.2 and 2.3.3, respectively.Next we describe our scheduling algorithm in more de-tail.
2.3.1 Scheduling Algorithm
Dynamic Programming is often applied to find optimalsolutions to optimization problems. The paradigm re-quires that an optimal solution to a problem be recur-sively decomposable into optimal solutions of smaller
sub-problems, with the solutions to the sub-problemscomputed first and saved in a table to avoid re-computation [6].
In our dynamic programming algorithm, we aim tocompute the set of mode switches (i.e., scheduling) ofthe given instruction trace that will minimize execu-tion time, given the constraints of our model. Wedecompose the optimal solution to the whole traceinto sub-problems that are optimal solutions to shortertraces with the same beginning, with the ultimate sub-problem being the trace with only the first instructionthat can be trivially scheduled. We recursively define asolution to a longer trace by observing that a solutionto a long trace is composed of a solution to a shortersub-trace, followed by a decision on whether to performa mode switch, followed by execution of the remaininginstructions in the chosen mode.
The dynamic programming algorithm keeps a N×2state table when given an input trace of N instructions.Each entry in the state table records the cost of an op-timal scheduling for every sub-trace (N of them) andmode that was last used in those sub-traces (2 modes).At each step of the algorithm, a solution for the nextsub-trace requires examining all possible locations ofthe previous mode switch to find the one that gives thebest schedule. For each possible mode switch location,the corresponding entry of the state table is examinedto retrieve the optimal solution for the sub-trace thatexecutes all instructions up to that entry in the cor-responding state (execution on the sequential, or theparallel core, respectively). This value is used to com-pute a candidate state table entry for the current stepby adding the mode switch cost (if switching modes),and the cost to execute the remaining section of thetrace from the candidate switch point up to the currentinstruction in the current mode (sequential, parallel).The lowest cost candidate over all earlier sub-traces ischosen for the current sub-trace.
The naive optimal algorithm described above runsin quadratic time with respect to the instruction tracelength. For traces of millions of instructions in length,quadratic time is too slow. We make an approxima-tion to enable the algorithm to run in time linear inthe length of the instruction trace. Instead of lookingback at all past instructions for each potential modeswitch point, we only look back 30,000 instructions.The modified algorithm is no longer optimal. We miti-gate this sub-optimality by first reordering instructionsbefore scheduling. We observed that the amount ofsub-optimality using this approach is insignificant.
To overcome the limitation of looking back only30,000 instructions in our algorithm, we reorderinstructions in dataflow order before scheduling.
Dataflow order is the order in which instructions wouldexecute if scheduled with our optimal scheduling al-gorithm. This linear-time preprocessing step exposesparallelism found anywhere in the instruction trace bygrouping together instructions that can execute in par-allel.
We remove instructions from the trace that do notdepend on the result of any other instruction. Most ofthese instructions are dead code created by our methodof exposing loop- and function-level parallelism, de-scribed in Section 3.2. Since dead code can executein parallel, we remove these instructions to avoid hav-ing them inflate the amount of parallelism we observe.Across our benchmark set, 27% of instructions are re-moved by this mechanism. Note that the dead code weare removing is not necessarily dynamically dead [5],but rather overhead related to sequential execution ofparallel code. The large number of instructions re-moved results from, for example, the expansion of x86push and pop instructions (for register spills/fills) intoa load or store micro-op (which we keep) and a stack-pointer update micro-op (which we do not keep).
2.3.2 Latency
We model the latency of migrating tasks between pro-cessors by imposing a constant runtime cost for eachmode switch. This cost is intended to model the la-tency of spawning a task, as well as transferring ofdata between the processors. If the amount of datatransferred is large relative to the bandwidth of thelink between processors, this is not a good model forthe cost of a mode switch. This model is reasonablewhen the mode switch is dominated by latency, for ex-ample in a heterogeneous multicore system where thememory hierarchy is shared (Figure 1(a)), so very littledata needs to be copied between the processors.
As described in Section 2.3, our scheduling algo-rithm considers the cost of mode switches. A modeswitch cost of zero would allow freely switching be-tween modes, while a very high cost would constrainthe scheduler to choose to run the entire trace on oneprocessor or the other, whichever was faster.
2.3.3 Bandwidth
Bandwidth is a constraint that limits the rate that datacan be transferred between processors in our model.Note that this does not apply to the processors’ link toits memory (Figure 1), which we assume to be uncon-strained. In our shared-memory model (Figure 1(a))mode switches do not need to copy large amounts ofdata so only latency (Section 2.3.2) is a relevant con-straint. In our private-memory model (Figure 1(b)),
bandwidth is consumed on the link connecting proces-sors as a result of a mode switch.
If a data value is produced by an instruction in oneprocessor and consumed by one or more instructions inthe other processor, then that data value needs to becommunicated to the other processor. A consequenceof exceeding the imposed bandwidth limitation is theaddition of idle computation cycles while an instruc-tion waits for its required operand to be transferred.In our model, we assume opportunistic use of band-width, allowing communication of a value as soon as itis ready, in parallel with computation.
Each data value to be transferred is sent sequentiallyand occupies the communication channel for a specificamount of time. Data values can be sent any time afterthe instruction producing the value executes, but mustarrive before the first instruction that consumes thevalue is executed. Data transfers are scheduled ontothe communication channel using an “earliest deadlinefirst” algorithm, which produces a scheduling with aminimum of added idle cycles.
Bandwidth constraints are applied by changing theamount of time each data value occupies on the com-munication channel. Communication latency is appliedby setting the deadline for a value some number of cy-cles after the value is produced.
Computing the bandwidth requirements and idle cy-cles needed, and thus the cost to switch modes, requiresa scheduling of the instruction trace, but the optimalinstruction trace scheduling is affected by the cost ofswitching modes. We approximate the ideal behav-ior by iteratively performing scheduling using a con-stant mode switch overhead for each mode switch andthen updating the average penalty due to bandwidthconsumption across all mode switches, then using thenew estimate of average switch cost as input into thescheduling algorithm, until convergence.
3 Simulation Infrastructure
We evaluate performance using micro-op traces ex-tracted from execution of a set of x86-64 benchmarkson the PTLsim [18] simulator. Each micro-op tracewas then scheduled using our scheduling algorithm forexecution on the heterogeneous system.
3.1 Benchmark Set
We chose our benchmarks with a focus towardsgeneral-purpose computing. We used the referenceworkloads for SPECint and SPECfp 2000 v1.3.1 (23benchmarks, except 253.perlbmk and 255.vortex which
Benchmark Descriptionlinear Compute average of 9 input pixels for
each output pixel. Each pixel is inde-pendent.
sepia 3× 3 constant matrix multiply on eachpixel’s 3 components. Each pixel is in-dependent.
serial A long chain of dependent instructions,has parallelism approximately 1 (noparallelism).
twophase Loops through two alternating phases,one with no parallelism, one with highparallelism. Needs to switch betweenprocessor types for high speedup.
Table 1. Microbenchmarks
did not run in our simulation environment), Physics-Bench 2.0 [31] (8 benchmarks), SimpleScalar 3.0 [29](used here as a benchmark), and four small mi-crobenchmarks (described in Table 1).
We chose PhysicsBench because it contains both se-quential and parallel phases in the benchmark, andwould be a likely candidate to benefit from heterogene-ity, as it would be unsatisfactory if both types of phaseswere constrained to one processor type [31].
Our SimpleScalar benchmark used the out-of-orderprocessor simulator from SimpleScalar/PISA, runninggo from SPECint 95, compiled for PISA.
We used four microbenchmarks to observe behav-ior at extremes of parallelism, as shown in Table 1.Linear and sepia are highly parallel, serial is serial,and twophase has alternating highly parallel and serialphases.
Figure 2 shows the average parallelism present in ourbenchmark set. As expected, SPECfp has more par-allelism (611) than SPECint (116) and PhysicsBench(83). Linear (4790) and sepia (6815) have the high-est parallelism, while serial has essentially no paral-lelism.
3.2 Traces
Micro-op traces were collected from PTLsim run-ning x86-64 benchmarks, compiled with gcc 4.1.2 -O2.Four microbenchmarks were run in their entirety, whilethe 32 real benchmarks were run through SimPoint [30]to choose representative sub-traces to analyze. Ourtraces are captured at the micro-op level, so in this pa-per instruction and micro-op are used interchangeably.
We used SimPoint to select simulation points of10-million micro-ops in length from complete runs ofbenchmarks. As recommended [30], we allowed Sim-
1
10
100
1000
10000
Parallelism
Figure 2. Average Parallelism of Our Benchmark Set
Point to decide how many simulation points should beused to approximate the entire benchmark run. We av-eraged 12.9 simulation points per benchmark. This is asignificant savings over the complete benchmarks whichwere typically several hundred billion instructions long.The weighted average of the results over each set ofSimPoint traces are presented for each benchmark.
We assume branches are correctly predicted. Manybranches, like loops, can often be easily predicted orspeculated or even restructured away during manualparallelization. As we are trying to evaluate the upper-bound of parallelism in an algorithm, we avoid limitingparallelism by not imposing the branch-handling char-acteristics of sequential machines. This is somewhatoptimistic as true data-dependent branches would atleast need be converted into speculation or predicatedinstructions.
Each trace is analyzed for true data dependencies.Register dependencies are recognized if an instructionconsumes a value produced by an earlier instruction(read-after-write). Dependencies on values carried bythe instruction pointer register are ignored, to avoiddependencies due to instruction-pointer-relative dataaddressing. Like earlier limit studies [17, 15], stackpointer register manipulations are ignored, to extractparallelism across function calls. Memory disambigua-tion is perfect: Dependencies are carried through mem-ory only if an instruction loads a value from memoryactually written by an earlier instruction.
It is also important to be able to extract loop-levelparallelism and avoid serialization of loops through theloop induction variable. We implemented a generic so-lution to prevent this type of serialization. We identifyinstructions that produce result values that are stati-
cally known, which are instructions that have no inputoperands (e.g. load constant). We then repeatedlylook for instructions dependent only on values that arestatically known and mark the values they produce asstatically known as well. We then remove dependencieson all statically-known values. This is similar to repeat-edly applying constant folding and constant propaga-tion optimizations [20] to the instruction trace. Thedead code that results is removed as described in Sec-tion 2.3.
A loop induction variable [20] is often initializedwith a constant (e.g. 0). Incrementing the inductionvariable by a constant depends only on the initializa-tion value of the induction variable, so the incrementedvalue is also statically known. Each subsequent incre-ment is likewise statically known. This removes seri-alization caused by the loop control variable, but pre-serves genuine data dependencies between loop itera-tions, including loop induction variable updates thatdepend on a variable computed value.
4 Results
In this section, we present our analysis of our ex-perimental results. First, we look at the speedup thatcan be achieved when adding a parallel co-processorto a sequential machine and show that the speedup ishighly dependent on the parallel instruction latency.We define parallel instruction latency as the ratio ofthe read-after-write latency of the parallel cores (recallwe assume a CPI of one for the sequential core). Wethen look at the effect of communication latency andbandwidth as parallel instruction latency is varied, andsee that the effect is significant, but small.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 10 100 1000 10000 100000
Fra
cti
on
of
Ins
tru
cti
on
s
on
Pa
rall
el
Pro
ce
ss
or
Parallel Instruction Latency
SPECfp
Average
SimpleScalar
SPECint
PhysBench
a.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 10 100 1000 10000 100000
Fra
cti
on
of
Instr
ucti
on
s
on
Para
llel P
rocesso
r
Parallel Instruction Latency
linear
sepia
twophase
serial
b.
Figure 3. Proportion of Instructions Sched-uled on Parallel Core. Real benchmarks (a),Microbenchmarks(b)
4.1 Why Heterogeneous?
Figures 3 and 4 give some intuition for the charac-teristics of the scheduling algorithm. Figure 4 showsthe parallelism of the instructions that are scheduledto use the parallel processor when our workloads arescheduled for best performance. Figure 3(a) shows theproportion of instructions that are assigned to executeon the parallel processor. As the instruction latencyincreases, sections of the workload where the benefitof parallelism does not outweigh the cost of slower se-quential performance become scheduled onto the se-quential processor, raising the average parallelism ofthose portions that remain on the parallel processor,while reducing the proportion of instructions that arescheduled on the parallel processor. The instructionsthat are scheduled to run on the sequential processorreceive no speedup, but scheduling more instructionson the parallel processor in an attempt to increase par-allelism will only decrease speedup.
The microbenchmarks in Figure 3(b) show ourscheduling algorithm works as expected. Serial hasnearly no instructions scheduled for the parallel core.
10
100
1000
10000
100000
1 10 100 1000 10000 100000
Avera
ge P
ara
llelism
in
Para
llel P
hases
Parallel Instruction Latency
SPECfp
Average
SPECint
SimpleScalar
PhysBench
Figure 4. Parallelism on Parallel Processor
Twophase has about 18.5% of instructions in its serialcomponent that are scheduled on the sequential pro-cessor leaving 81.5% on the parallel processor, whilesepia and linear highly prefer the parallel processor.
We look at the potential speedup of adding a par-allel processor to an existing sequential machine. Fig-ures 5(a) and (b) show the speedup of our benchmarksfor varying parallel instruction latency, as a speedupover a single sequential processor. Two plots for eachbenchmark group are shown: The solid plots show thespeedup of a heterogeneous system where communica-tion has no cost, while the dashed plot shows speedupwhen communication is very expensive. We focus onthe solid plots in this section.
It can be observed from Figures 5(a) and (b) thatas the instruction latency increases, there is a signifi-cant loss in the potential speedup provided by the extraparallel processor, becoming limited by the amount ofparallelism available in the workload that can be ex-tracted, as seen in Figure 3. Since our parallel proces-sor model is somewhat optimistic, the speedups shownhere should be regarded as an upper bound of whatcan be achieved.
With a parallel processor with GPU-like instruc-tion latency of 100 cycles, SPECint would be lim-ited to a speedup of 2.2×, SPECfp to 12.7×, Physics-Bench to 2.5×, with 64%, 92%, and 72% of instructionsscheduled on the parallel processor, respectively. Thespeedup is much lower than the peak relative through-put of a GPU compared to a sequential CPU (≈ 50×),which shows that if a GPU-like processor were used asthe parallel processor in a heterogeneous system, thespeedup on these workloads would be limited by theparallelism available in the workload, while still leav-ing much of the GPU hardware idle.
In contrast, for highly-parallel workloads, thespeedups achieved at an instruction latency of 100 are
1
10
100
1000
1 10 100 1000 10000 100000
Sp
eed
up
Parallel Instruction Latency
SPECfp
SPECfp NoSwitch
SimpleScalar
SS NoSwitch
PhysBench
PhysBench NoSwitch
SPECint
SPECint NoSwitch
a.
1
10
100
1000
10000
1 10 100 1000 10000 100000
Sp
eed
up
Parallel Instruction Latency
sepia
sepia NoSwitch
linear
linear NoSwitch
twophase
twophase NoSwitch
serial
serial NoSwitch
b.
Figure 5. Speedup of Heterogeneous System:(a) Real benchmarks, (b) Microbenchmarks.Ideal communication (solid), communicationforbidden (dashed, NoSwitch).
similar to the peak throughput available in a GPU. Thehighly-parallel linear filter and sepia tone filter (Fig-ure 5(b)) kernels have enough parallelism to achieve50-70× speedup at an instruction latency of 100. Ahighly-serial workload (serial) does not benefit from theparallel processor.
Although current GPU compute solutions built withefficient low-complexity multi-threaded cores are suf-ficient to accelerate algorithms with large amountsof thread-level parallelism, general-purpose algorithmswould be unable to utilize the large number of threadcontexts provided by the GPU, while under-utilizingthe arithmetic hardware available.
4.2 Communication
In this section, we evaluate the impact of com-munication latency and bandwidth on the potentialspeedup, comparing performance between the extremecases where communication is unrestricted and commu-nication is forbidden. The solid plots in Figure 5 show
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000
Slo
wd
ow
n
Parallel Instruction Latency
SPECint
SimpleScalar
PhysBench
SPECfp
a.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000
Slo
wd
ow
n
Parallel Instruction Latency
serial
sepia
linear
twophase
b.
Figure 6. Slowdown of infinite communica-tion cost (NoSwitch) compared to zero com-munication cost. Real benchmarks (a), Mi-crobenchmarks (b).
speedup when there are no limitations on communi-cation, while the dashed plots (marked NoSwitch) hascommunication so expensive that the scheduler choosesto run the workload entirely on the sequential pro-cessor or parallel processor, never switching betweenthem. Figures 6(a) and (b) show the ratio between thesolid and dashed plots in Figures 5(a) and (b), respec-tively, to highlight the impact of communication. Atboth extremes of instruction latency, where the work-load is mostly sequential or mostly parallel, commu-nication has little impact. It is in the moderate rangearound 100-200 where communication potentially mat-ters most.
The potential impact of expensive (latency andbandwidth) communication is significant. For exam-ple, at a GPU-like instruction latency of 100, SPECintachieves only 56%, SPECfp 23%, and PhysicsBench44% of the performance of no communication, as canbe seen in Figure 6(a). From our microbenchmark set(Figures 5(b) and 6(b)), twophase is particularly sen-sitive to communication costs, and gets no speedup forinstruction latency above 10. We look at more realistic
0.7
0.75
0.8
0.85
0.9
0.95
1
1 10 100 1000 10000 100000
Slo
wd
ow
n F
acto
r
vs. Z
ero
-Co
st
Sw
itch
Parallel Instruction Latency
SPECint
PhysBench
SPECfp
SimpleScalar
Figure 7. Slowdown due to 100,000 cycles ofmode-switch latency. Real benchmarks.
constraints on latency and bandwidth in the followingsections.
4.2.1 Latency
Poor parallel performance is often attributed to highcommunication latency [21]. Heterogeneous processingadds a new communication requirement—the commu-nication channel between sequential and parallel pro-cessors (Figure 1). In this section, we measure theimpact of the latency of this communication channel.
We model this latency by requiring that switchingmodes between the two processor types causes a fixedamount of idle computation time. In this section, we donot consider the bandwidth of the data that needs tobe transferred. This model represents a heterogeneoussystem with shared memory (Figure 1(a)), where mi-grating a task does not involve data copying, but onlyinvolves a pipeline flush, notification to the other pro-cessor of work, and potentially flushing private cachesif caches are not coherent.
Figure 7 shows the slowdown when we include100,000 cycles of mode-switch latency in our perfor-mance model and scheduling, when compared to zero-latency mode switch.
The impact of imposing a delay for every modeswitch has only a minor effect on runtime. AlthoughFigure 6(a) suggested that the potential for perfor-mance loss due to latency is great, even when eachmode switch costs 100,000 cycles (greater than 10usat current clock rates), most of the speedup remains.We can achieve ≈85% of the performance of a hetero-geneous system with zero-cost communication. Statedanother way, reducing latency between sequential andparallel cores might provide an average ≈ 18% perfor-mance improvement.
To gain further insight into the impact of mode
0
5000
10000
15000
20000
25000
1 10 100 1000 10000 100000
Mo
de S
wit
ch
es p
er
10M
In
str
ucti
on
s
Parallel Instruction Latency
SPECint
SimpleScalar
PhysBench
SPECfp
a.
0
500
1000
1500
2000
2500
3000
1 10 100 1000 10000 100000
Mo
de S
wit
ch
es p
er
10M
In
str
ucti
on
s
Parallel Instruction Latency
SPECint
SimpleScalar
PhysBench
SPECfp
b.
x
0
20
40
60
80
100
120
1 10 100 1000 10000 100000
Mo
de S
wit
ch
es p
er
10M
In
str
ucti
on
s
Parallel Instruction Latency
SPECint
SimpleScalar
PhysBench
SPECfp
c.
Figure 8. Mode switches as switch latencyvaries: (a) zero cycles, (b) 10 cycles, (c) 1000cycles.
switch latency, Figure 8 illustrates the number of modeswitches per 10 million instructions as we vary the costof switching from zero to 1000 cycles. As the cost ofa mode switch increases the number of mode switchesdecreases. Also, more mode switches occur at interme-diate values of parallel instruction latency where thebenefit of being able to use both types processors out-weighs the cost of switching modes.
For systems with private memory (e.g. discreteGPU), data copying is required when migrating a taskbetween processors at mode switches. We considerbandwidth constraints in the next section.
4.2.2 Bandwidth
In the previous section, we saw that high communi-cation latency had only a minor effect on achievableperformance. Here, we place a bandwidth constrainton the communication between processors. Data thatneeds to be communicated between processors is re-stricted to a maximum rate, and the processors areforced to wait if data is not available in time for aninstruction to use it, as described in Section 2.3.3. Wealso include 1,000 cycles of latency as part of the model.
We first construct a model to represent PCI Express,
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1 10 100 1000 10000 100000
Slo
wd
ow
n
Parallel Instruction Latency
SimpleScalar
SPECint
PhysBench
SPECfp
Figure 9. Slowdown due to a bandwidth con-straint of 8 cycles per 32-bit value and 1,000cycles latency, similar to PCI Express x16.Real benchmarks.
1
10
100
0.001 0.01 0.1 1
Sp
eed
up
Normalized Bandwidth
SPECfp
PhysBench
SPECint
SimpleScalarPCI Express x16
AGP 4x
PCI
Figure 10. Speedup over sequential proces-sor for varying bandwidth constraints. Realbenchmarks.
as discrete GPUs are often attached to the system thisway. PCI Express x16 has a peak bandwidth of 4GB/sand latency around 250ns [4]. Assuming current pro-cessors perform about 4 billion instructions per secondon 32-bit data values, we can model PCI Express us-ing a latency of about 1,000 cycles and bandwidth of 4cycles per 32-bit value. Being somewhat pessimistic toaccount for overheads, we use a bandwidth of 8 cyclesper 32-bit value (about 2GB/s).
Figure 9 shows the performance impact of restrict-ing bandwidth to one 32-bit value every 8 clocks with1,000 cycles of latency. Slowdown is worse than with100,000 cycles of latency, but the benchmark set af-fected the most (SPECfp) can still achieve ≈67.4% ofthe ideal performance at a parallel instruction latencyof 100. Stated another way, increasing bandwidth be-tween sequential and parallel cores might provide an
average 1.48× performance improvement for workloadslike SPECfp. For workloads such as PhysicsBench andSPECint the potential benefits appear lower (1.33×and 1.07× potential speedup, respectively). Compar-ing latency (Figure 7) to bandwidth (Figure 9) con-straints, SPECfp and PhysicsBench has more perfor-mance degradation than under a pure-latency con-straint, but SPECint performs better, suggesting thatSPECint is less sensitive to bandwidth.
The above plots suggest that a heterogeneous sys-tem attached without a potentially-expensive, low-latency, high-bandwidth communication channel canstill achieve much of the potential speedup.
To further evaluate whether GPU-like systems couldbe usefully attached using even lower bandwidth in-terconnect, we measure the sensitivity of performanceto bandwidth for instruction latency 100. Figure 10shows the speedup for varying bandwidth. Bandwidth(x-axis) is normalized to 1 cycle per datum, equivalentto about 16GB/s in today’s systems. Speedup (y-axis)is relative to the workload running on a sequential pro-cessor.
SPECfp and PhysicsBench have similar sensitivityto reduced bandwidth, while SPECint’s speedup lossat low bandwidth is less significant (Figure 10). Al-though there is some loss of performance at PCI Ex-press speeds (normalized bandwidth = 1/8), about halfof the potential benefit of heterogeneity remains atPCI-like speeds (normalized bandwidth = 1/128). AtPCI Express x16 speeds, SPECint can achieve 92%,SPECfp 69%, and PhysicsBench 78% of the speedupachievable without latency and bandwidth limitations.
As can be seen from the above data, heteroge-neous systems can potentially provide significant per-formance improvements on a wide range of applica-tions, even when system cost sensitivity demands high-latency, low-bandwidth interconnect. However, it alsoshows that applications are not entirely insensitive tolatency and bandwidth, so high-performance systemswill still need to worry about increasing bandwidth andlowering latency.
The lower sensitivity to latency than to bandwidthsuggests that a shared-memory multicore heteroge-neous system would be of benefit, as sharing a singlememory system avoids data copying when migratingtasks between processors, leaving only synchronizationlatency. This could increase costs, as die size wouldincrease, and the memory system would then need tosupport the needs of both sequential and parallel pro-cessors. A high-performance off-chip interconnect likePCI Express or HyperTransport may be a good com-promise.
5 Related Work
There have been many limit studies on the amountof parallelism within sequential programs.
Wall [7] studies parallelism in SPEC92 under vari-ous limitations in branch prediction, register renaming,and memory disambiguation. Lam et al. [17] stud-ies parallelism under branch prediction, condition de-pendence analysis, and multiple-fetch. Postiff et al.[15] perform a similar analysis on the SPEC95 suiteof benchmarks. These studies showed that significantamounts of parallelism exist in typical applications un-der optimistic assumptions. These studies focused onextracting instruction-level parallelism on a single pro-cessor. As it becomes increasingly difficult to extractILP out of a single processor, performance increasesoften comes from multicore systems.
As we move towards multicore systems, there arenew constraints, such as communication latency, thatare now applicable. Vachharajani et al. [21] studiesspeedup available on homogeneous multiprocessor sys-tems. They use a greedy scheduling algorithm to assigninstructions to cores. They also scale communicationlatency between cores in the array of cores and findthat it is a significant limit on available parallelism.
In our study, we extend these analyses to heteroge-neous systems, where there are two types of processors.Vachharajani examined the impact of communicationbetween processors within a homogeneous processor ar-ray. We examine the impact of communication betweena sequential processor and an array of cores. In ourmodel, we roughly account for communication latencybetween cores within an array of cores by using higherinstruction read-after-write latency.
Heterogeneous systems are interesting because theyare commercially available [10, 25, 2] and, for GPUcompute systems, can leverage the existing softwareecosystem by using the traditional CPU as its sequen-tial processor. They have also been shown to be morearea and power efficient [16, 26, 27] than homogeneousmulticore systems.
Hill and Marty [16] uses Amdahl’s Law to show thatthere are limits to parallel speedup, and makes thecase that when one must trade per-core performancefor more cores, heterogeneous multiprocessor systemsperform better than homogeneous ones because non-parallelizable fragments of code do not benefit frommore cores, but do suffer when all cores are made slowerto accommodate more cores. They indicate that moreresearch should be done to explore “the scheduling andoverhead challenges that Amdahl’s model doesn’t cap-ture”. Our work can be viewed as an attempt to furtherquantify the impact that these challenges present.
6 Conclusion
We conducted a limit study to analyze the behaviorof a set of general purpose applications on a heteroge-neous system consisting of a sequential processor anda parallel processor with higher instruction latency.
We showed that instruction read-after-write latencyof the parallel processor was a significant factor in per-formance. In order to be useful for applications withoutcopious amounts of parallelism, we believe that instruc-tion read-after-write latencies of GPUs will need to de-crease and thus GPUs can no longer rely exclusively onfine-grain multithreading to keep utilization high. Wenote that VLIW or superscalar issue combined withfine-grained multithreading [3, 19] do not inherentlymitigate this read-after-write latency, though addingforwarding [12] might. Our data shows that latencyand bandwidth of communication between the parallelcores and the sequential core, while significant factors,have comparatively minor effects on performance. La-tency and bandwidth characteristics of PCI Expresswas sufficient to achieve most of the available perfor-mance.
Note that since our results are normalized to thesequential processor, our results scale as processor de-signs improve. As sequential processor performanceimproves in the future, the read-after-write latencyof the parallel processor will also need to improve tomatch.
Manufacturers have and will likely continue to buildsingle-chip heterogeneous multicore processors. Thedata presented in this paper may suggest the reasonsfor doing so are other than to obtain higher perfor-mance from reduced communication overheads on gen-eral purpose workloads. A subject for future work isevaluating whether such conclusions hold under morerealistic evaluation scenarios (limited hardware paral-lelism, detailed simulations, real hardware) along withexploration of a wider set of applications (ideally in-cluding real workloads carefully tuned specifically for atightly coupled single-chip heterogeneous system). Aswell, this work does not quantify the effect that in-creasing problem size [11] may have on the question ofthe benefits of heterogeneous (or asymmetric) multi-core performance.
Acknowledgements
We thank Dean Tullsen, Bob Dreyer, Hong Wang,Wilson Fung and the anonymous reviewers for theircomments on this work. This work was partly sup-ported by the Natural Sciences and Engineering Re-search Council of Canada.
References
[1] AMD Inc. The future is fusion. http://sites.amd.com/us/Documents/AMD fusion Whitepaper.pdf, 2008.
[2] AMD Inc. ATI Stream Computing User Guide, 2009.[3] AMD Inc. R700-Family Instruction Set Architecture,
2009.[4] B. Holden. Latency Comparison Between HyperTrans-
port and PCI-Express in Communications Systems.http://www.hypertransport.org/.
[5] J. A. Butts and G. Sohi. Dynamic Dead-InstructionDetection and Elimination. In Proc. Int’l Conf. onArchitectural Support for Programming Languages andOperating Systems (ASPLOS), pages 199–210, 2002.
[6] T. H. Cormen, C. E. Leiserson, R. L. Rivest, andC. Stein. Introduction to Algorithms. McGraw Hill,2nd edition, 2001.
[7] D. W. Wall. Limits of Instruction-Level Parallelism.Technical Report 93/6, DEC WRL, 1993.
[8] W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt.Dynamic Warp Formation: Efficient MIMD ControlFlow on SIMD Graphics Hardware. To appear in:ACM Trans. Architec. Code Optim. (TACO), 2009.
[9] G. M. Amdahl. Validity of the single-processor ap-proach to achieving large scale computing capabilities.In AFIPS Conf. Proc. vol. 30, pages 483–485, 1967.
[10] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns,T. R. Maeurer, D. Shippy. Introduction to the Cellmultiprocessor. IBM J. R&D, 49(4/5):589–604, 2005.
[11] J. L. Gustafson. Reevaluating Amdahl’s Law. Com-munications of ACM, 31(5):532–533, 1988.
[12] J. Laudon, A. Gupta, and M. Horowitz. Interleaving:A Multithreading Technique Targeting Multiproces-sors and Workstations. In ASPLOS, pages 308–318,1994.
[13] E. Lindholm, M. J. Kligard, and H. P. Moreton. Auser-programmable vertex engine. In SIGGRAPH,pages 149–158, 2001.
[14] E. Lindholm, J. Nickolls, S. Oberman, and J. Mon-trym. NVIDIA Tesla: A unified graphics and comput-ing architecture. Micro, IEEE, 28(2):39–55, March-April 2008.
[15] M. A. Postiff, D. A. Greene, G. S. Tyson, T. N. Mudge.The Limits of Instruction Level Parallelism in SPEC95Applications. ACM SIGARCH Comp. Arch. News,27(1):31–34, 1999.
[16] M. D. Hill, M. R. Marty. Amdahls Law in the Multi-core Era. IEEE Computer, 41(7):33–38, 2008.
[17] M. S. Lam, R. P. Wilson. Limits of control flow onparallelism. In Proc. 19th Int’l Symp. on ComputerArchitecture, pages 46–57, 1992.
[18] M. T. Yourst. PTLsim: A Cycle Accurate Full Systemx86-64 Microarchitectural Simulator. In IEEE Int’l
Symp. on Performance Analysis of Systems and Soft-ware ISPASS, 2007.
[19] J. Montrym and H. Moreton. The GeForce 6800. IEEEMicro, 25(2):41–51, 2005.
[20] S. Muchnick. Advanced Compiler Design and Imple-mentation. Morgan Kauffman, 1997.
[21] N. Vachharajani, M. Iyer, C. Ashok, M. Vachhara-jani, D. I. August, D. Connors. Chip multi-processorscalability for single-threaded applications. ACMSIGARCH Comp. Arch. News, 33(4):44–53, 2005.
[22] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scal-able parallel programming with cuda. Queue, 6(2):40–53, 2008.
[25] NVIDIA Corp. CUDA Programming Guide, 2.2 edi-tion, 2009.
[26] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P.Jouppi. Single-ISA Heterogeneous Multi-Core Ar-chitectures for Multithreaded Workload Performance.In Proc. 31st Int’l Symp. on Computer Architecture,page 64, 2004.
[27] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan,D. M. Tullsen. Single-ISA Heterogeneous Multi-CoreArchitectures: The Potential for Processor Power Re-duction. In Proc. 36th IEEE/ACM Int’l Symp. onMicroarchitecture, page 81, 2003.
[28] S. Borkar. Thousand Core Chips — A TechnologyPerspective. In Proc. 44th Annual Conf. on DesignAutomation, pages 746–749, 2007.
[29] T. Austin, E. Larson, D. Ernst. SimpleScalar: Aninfrastructure for computer system modeling. IEEEComputer, 35(2), 2002.
[30] T. Sherwood, E Perelman, G. Hamerly, B. Calder. Au-tomatically Characterizing Large Scale Program Be-havior. In Proc. 10th Int’l Conf. on Architectural Sup-port for Programming Languages and Operating Sys-tems ASPLOS, 2002.
[31] T. Y. Yeh, P. Faloutsos, S. J. Patel, G. Reinman. Par-allAX: an architecture for real-time physics. In Proc.34th Int’l Symp. on Computer Architecture, pages232–243, 2007.
[32] J. E. Thornton. Parallel operation in the control data6600. In AFIPS Proc. FJCC, volume 26, pages 33–40,1964.
[33] H. Wong, A. Bracy, E. Schuchman, T. M. Aamodt,J. D. Collins, P. H. Wang, G. Chinya, A. K. Groen,H. Jiang, and H. Wang. Pangaea: A Tightly-CoupledIA32 Heterogeneous Chip Multiprocessor. In Int’lConf. on Parallel Architectures and Compilation Tech-niques (PACT 2008), pages 52–61, Oct. 2008.
Our Many-core Benchmarks Do Not Use That Many Cores Paul D. Bryan Jesse G. Beu Thomas M. Conte Paolo Faraboschi Daniel Ortega
Georgia Institute of Technology HP Labs, Exascale Computing Lab
Mondriaan memory protection is a hardware/soft-ware system that provides efficient fine-grainedmemory protection. Other researchers have em-braced the Mondriaan design as an efficient way toassociate metadata with each 32-bit word of mem-ory. However, the Mondriaan design is efficient onlywhen the metadata has certain properties. This papertries to clarify when a Mondriaan-like design is ap-propriate for a particular problem. It explains how toreason about the space overhead of a Mondriaan de-sign and identifies the significant time overheads asrefills to the on-chip metadata cache and the time forsoftware to encode and write metadata table entries.
1 Introduction
Mondriaan memory protection (MMP) is hardware/-software co-design for fine-grained memory protec-tion. Like page tables, the heart of MMP is aset of hardware structures and software-written datastructures that efficiently associate protection meta-data with user data. Other researchers have usedMondriaan-like structures when they need to effi-ciently associate non-protection metadata with userdata [ZKDK08, CMvPT07]. However, MMP is onlyefficient under certain assumptions about the meta-data and data. This paper tries to clarify these as-sumptions to guide researchers in when MMP can beuseful to address their problems.
While the computer science publication system iseffective at creating incentives for researchers to pub-lish innovative results, it is less effective at encour-aging researchers to reflect on, and publicly critique,their own work. Students of the field are often leftwondering why a promising sounding idea was left
unimplemented. Or they wonder why certain ideasfrom an early paper on a subject are left out of follow-on work. Did those ideas fail or were they simply notexplored?
While journals provide some outlet to summarizethe progress of a research project, they often defaultto extended versions of conference papers. This pa-per is much shorter than a journal paper and tries toconvey insights and experience, rather than rigorousquantitative evidence for its conclusions.
While providing a compact summary of MMP re-search, this paper highlights the assumptions made byvarious MMP implementations that are required forhigh performance. These assumptions do not alwaysapply to systems developed by other researchers. Thepurpose of this paper is to allow other researchersto quickly determine if their application is likely toperform well with MMP-like hardware or what theywould have to modify to make it perform well.
This paper discusses the interplay between the fol-lowing design decisions.
1. Space overhead.The space overhead for MMPis approximately the average number of meta-data bits per data item. MMP keeps space over-head low by storing 2 bits of protection informa-tion per 32-bit word (approximately a 6% over-head). Tables can be encoded for greater spaceefficiency if there are long stretches of memorywith identical metadata values.
2. PLB reach. MMP includes an on-chip associa-tive memory for its metadata called the protec-tion lookaside buffer (PLB). For the PLB hard-ware to be an effective cache, the metadata musthave particular properties, either much of it iscoarse-grained, or it has long segments withidentical metadata values.
3. Software overheads. MMP requires systemsoftware to write the metadata tables. The meta-
1
data format must be simple enough and writteninfrequently enough to prevent software fromsignificantly reducing performance.
2 MMP history
MMP started in 2002 as follow-on work to low-powerdata caches [WLAA01]. Our idea was to automati-cally migrate unused program data to a portion of thecache/memory hierarchy that requires lower power tomaintain state. To track program objects, which tendto be small and not naturally aligned, we needed adata structure. The data structure would be written bysoftware and read by hardware because we thoughtthe hardware would make frequent decisions aboutwhat data belongs in high-power fast memory andwhat can reside in low-power slow memory.
During the design of the hardware data structure,we realized that solving the basic problem of hav-ing hardware track user-defined data structures wasmore profound than the application of moving data tosave energy. We soon left that motivation and chosefine-grained protection. The plugin model for pro-gram functionality extension made the motivation forfine-grained protection clear. Programs (like the OSand a web browser) load user-supplied code directlyinto their address space to extend functionality. Theproblem with this approach is that a bug can crash theentire program—plugins are fast, but not safe. Fine-grained protection can restore the safety without re-ducing the speed of the plugin extensibility model.
The first MMP paper focused on the format of thehardware tables [WCA02]. This paper is most oftencited by those interested in MMP. It introduces thebasic idea of MMP and presents both a simple tableformat and a more advanced table format, a technicalinnovation explained in§3.2. It also contains a designfor fine-grained memory remapping, which allows auser to stitch together bits of memory into a contigu-ous buffer. The design for remapping is a bit compli-cated, but provides good support for zero-copy net-working. The issues for supporting protection dom-inated the project after this paper and the remappingwas dropped, simply for lack of space.
The follow-on paper [WA03] describes how the OSsupport for fine-grained protection domains wouldwork and how to support safe calling between pro-
tection domains. Though our experience was limitedat the time, much of our design ended up in our finalimplementation. My thesis [Wit04] continued to re-fine the OS support and added ideas for protecting thestack. MMP culminated in an SOSP paper [WRA05],which is the most complete implementation of thesystem, though it is not often cited. Most of the inter-est in MMP comes from computer architects, many ofwhom do not regularly read the proceedings of SOSP.
3 MMP technical summary
This section provides a high-level summary of howMMP works, with a focus on how MMP-like hard-ware would be used for other applications. Thethree main features of MMP are memory protection,protected cross-domain calling, and stack protection.The feature most attractive for other uses is a gen-eralization of memory protection, which associatesmetadata with every word of user data. This sectionfocuses on the general design of that protection mech-anism.
3.1 CPU modifications
MMP consists of hardware and software to providefine-grained memory protection. MMP modifies theprocessor pipeline to check permissions on everyload, store, and instruction fetch. MMP is designedto be simple enough to allow an efficient implemen-tation for modern processors, but powerful enough toallow a variety of software services to be built on topof it. The permissions information managed by MMPcould be generalized to any metadata.
Figure 1 shows the basics of the MMP hardware.MMP adds aprotection lookaside buffer (PLB) that isan associative memory, like a TLB. The PLB cachesentries of a memory-resident permissions (or meta-data) table, just as a TLB caches entries of a memory-resident page table. The PLB is indexed by virtualaddress. MMP also adds two registers, the protectiondomain ID, and a pointer to the base of the permis-sions table. The protection domain ID identifies theprotection (or metadata) context of a particular kernelthread to the PLB, just as an address space identifieridentifies a kernel thread to a TLB.
The protection domain ID register is not necessary,but without it, the entire PLB must be flushed on
2
MEMORY
Permissions
Permissions Table Base
Protection Domain ID (PD−ID)
Protection
Buffer (PLB)Lookaside
CPU
refill
Table
Figure 1: The major components of the Mondriaanmemory protection system.
every domain switch. For some kinds of metadata,this might be acceptable. For example, if each ker-nel thread is its own domain, then domain switchesonly happen on context switches, which are relativelyrare. In this case, the protection domain ID can bedispensed with, just as the x86 does not tag its TLB.However, many consider the lack of tags in the x86TLB a major design flaw.
On a memory reference, the processor checks thePLB for permissions (or performs whatever metadatacheck is specified by a system using an MMP-likestructure). If the PLB does not have the permissionsinformation, either hardware or software looks it upin the permissions table residing in memory. Thereload mechanism caches the matching entry fromthe permissions table in the PLB, and possibly writesit to the address register sidecar registers. Sidecarregisters were a feature of the early MMP design.They are a cache for the PLB, meant to reduce theenergy cost of indexing the PLB. They are simply anenergy optimization, and because they are inessential,we do not mention them further.
Just like the TLB, the PLB is indexed by virtualaddress. The PLB lookup can happen in parallel withaddress translation because MMP stores its metadataper virtual address. Virtual addresses that alias to thesame physical address can have different permissionsvalues in MMP.
One of the hardware efficiencies of MMP is thatthe permissions check can start early in the pipelineand can overlap most of the address translation stagesand computational steps of the pipeline. The permis-sions check need finish only before instruction retire-
ment.
3.2 Permissions table
The MMP protection table represents eachuser seg-ment, using one or moretable segments. A user seg-ment is a contiguous run of memory words with asingle permissions value that has some meaning tothe user. For example, a memory block returnedfrom kmalloc would be a user segment. User seg-ments start at any word boundary and do not haveto be aligned. A table segment is a unit of permis-sions representation convenient for the permissionstable. MMP is not efficient for arbitrary user seg-ments, it assumes certain properties of user segmentsto achieve efficient execution (§3.2).
System software converts user segments into tableentries when permissions are set on a memory region.As explained in§4.3, the frequency and complexityof transforming user segments into table segments de-termines whether software is an appropriate choicefor encoding table segments. Some table entry for-mats are inefficient for software to write at the updaterates required of applications.
Mid Index (10) Leaf Index (6)
Effective address (bits 31−0)
Bits (21−12) Bits (11−6) Bits (5−0)Bits (31−22)
Leaf Offset (6)Root Index (10)
Figure 2: How an address indexes the trie.
MMP uses a trie to store metadata, just like a pagetable. The top bits of an address index into a table,whose entry can be a pointer to another table whichis indexed by the most significant bits remaining inthe address.
Figure 2 shows which bits of a 32-bit virtual ad-dress are used to index a particular level of theMMP permissions table trie. Three loads are suffi-cient to find the metadata for any user 32-bit word.The lookup algorithm (that can be implemented insoftware or hardware, just like a TLB) is shown inpseudo-code in Figure 3. The root table has 1024 en-tries, each of which maps a 4 MB block. Entries inthe mid-level table map 4 KB blocks. The leaf leveltables have 64 entries, each providing individual per-missions for at least 16 four-byte words. The table in-dices are expanded for 64-bit address spaces [Wit04].
Figure 3:Pseudo-code for the trie table lookup algorithm.The table is indexed with an address and returns a permis-sions table entry. The base of the root table is held in a ded-icated CPU register. The implementation ofis tbl ptrdepends on the encoding of the permission entries.
The granularity of the metadata is determined bythe level at which it appears. Each two-bit entry ina leaf table encodes permissions for a user word ofmemory. At one level higher, each mid-level entryrepresents permissions for an entire 4KB page. Re-gions of at least 4KB that share a single permissionsvalue can therefore be represented with less spaceoverhead. This space savings happens regardless ofthe entry format.
MMP designs have used different permissions en-try formats, notably bitmaps and run-length encoded(RLE) entries (shown in Figure 4). Each leaf entryin bitmap format has 16 two-bit values indicating thepermissions for each of 16 words. Run-length en-coded entries encode permissions as 4 regions withdistinct permissions values, dedicating 8 out of 32bits to metadata.
RLE entries cannot represent arbitrary word-levelmetadata. They assume that contiguous words havethe same metadata value. For MMP’s RLE entries,there can be no more than 4 distinct metadata regionsin the entry’s 16 data words. If each word has a meta-data value distinct from its immediate neighbors, thenthere are 16 metadata regions and that cannot be rep-resented with an RLE entry. Bitmap entries are usedas backup in this case.
The permissions data in MMP run-length encodedentries overlaps with previous and succeeding en-tries. In addition to permissions information for the
16 words, they can also contain permissions for upto 31 words previous and 32 words subsequent to the16. In the best case a single RLE entry can containpermissions from 5 distinct bitmap entries.
MMP uses RLE entries to overlap permissions in-formation, but they can be used to save space in leaf-level tables. A 32-bit RLE entry can represent per-missions information about 79 words. Taking intoaccount alignment, each entry could encode permis-sions for 64 words instead of 16, bringing down theaverage space overheads for leaf-level tables from 6%to 1.6%. Doing so would change the lookup algo-rithm in Figure 3, because the leaf index would re-quire only 4 bits, leaving 8 bits for the leaf offset.This new RLE format would be more restrictive, onlyallowing 4 permissions regions in every aligned 64word block.
4 Requirements for good MMP per-formance
This section distills our observations on the factorssalient for a particular instantiation of an MMP-likesystem to have good performance.
4.1 Space overhead
While physical memory capacity continues to growat an impressive rate, MMP-like systems consumememory in proportion to the virtual memory used bya process. As processes use more memory, MMPuses more memory to hold the metadata associatedwith the data. Keeping the size of the metadata tablesreasonable is a first order concern for the practicalityof the system.
For the simplest Mondriaan system [WCA02], thespace overhead of the most fine-grained tables is ap-proximately 6%, for 2 bits of metadata per 32-bitdata word. Both bitmaps and run-length encoded en-tries dedicate two bits of table entry per user word inleaf-level tables that manage permissions for 32-bitwords. The run-length encoding could be adjustedfor lower space overhead (§3.2). The mid-level en-tries that manage permissions for 4KB pages specify2 bits of metadata per aligned 512 bytes of data, for aspace overhead of 0.8%.
Figure 4: The bit allocation for a run-length encoded (RLE) permission table entry.
memory protection to the Linux kernel), modifies thekernel memory allocator to create larger, aligned dataregions. The slab [Bon94] allocator takes a memorypage and breaks it into equal sized chunks that aredoled out bykmalloc, the general-purpose kernelmemory allocator. By turning read/write permissionson and off for the entire page, rather than for each in-dividual call tokmalloc, Mondrix greatly reducedthe space overheads of the permissions. The cost isless memory protection. A read or write into the unal-located area of a page being used as a slab is an errorcan be detected if the page’s permissions are man-aged on a per-word basis. However, Mondrix forgoesthis protection, enabling read and write permissionsto the entire page once any of it is used.
Colorama [CMvPT07] uses a Mondriaan designand extends the permission table entries with 12-bitcolor identifiers that allow the processor to infer syn-chronization for user data. The color identifiers bringthe overhead from 2 bits per 32-bit word to 14 bits per32-bit word, which is a 44% space overhead. Run-length encoding can bring down this overhead. Us-ing MMP’s RLE entry expands the 8 permissions bitsto 48 for a 14% space overhead (though keeping en-tries aligned, which is necessary for a realistic design,would increase the space overhead to nearly 19%).Furthermore, not every data item needs to be colored,only those accessed by multiple threads. The Col-orama implementation has a measured overhead of0–28% space overhead.
Loki [ZKDK08] uses tagged memory to reduce theamount of trusted code in the HiStar operating sys-tem. Loki differs from MMP in that the tags are forphysical memory. Additionally, Loki maintains twodistinct maps, one from physical memory address totag and another from tag to access permissions.
Loki segregates pages on the basis of whether theyneed fine-grained tags. Pages with fine-grained tagshave 100% space overhead (a 32-bit tag for a 32-bitword), while pages without fine-grained tags (one tagfor the entire page) have 0.1% space overhead. Theauthors see a variable fraction of memory pages that
use fine-grained tags, from 3–65%. For this scenario,the fraction of memory pages using fine-grained tagsdictates the memory overhead, so the application thatuses fine-grained tags for 65% of its pages, experi-ences a space overhead of 65%. Loki does not use anMMP design for its tags, but the designers note thatMMP’s RLE entries could save space.
4.2 PLB reach
MMP uses a protection lookaside buffer (PLB) tocache permissions information for data accessed bythe CPU, avoiding long walks through the memoryresident permissions table. A high hit rate for thePLB is essential for low latency performance. With-out a high hit rate, the processor is constantly fetch-ing data from the permissions tables, which increasesmemory pressure, cache area pressure and decreasesthe rate at which instructions can retire.
MMP contains several features to enhance the hitrate in the PLB that can be adopted as-is by otherprojects. The PLB allows different entries to apply todifferent power-of-two sized ranges. This mechanismallows large granularity entries to co-exist with word-granularity entries (much like super-pages in TLBs).The PLB tags also include protection domain IDs toavoid flushing the PLB on domain switches. Tags areimportant for Mondrix because its fine-grained pro-tection domains can be crossed as frequently as every664 cycles [WRA05]. Other applications of MMPmight not have such frequent domain crossings.
The main technique for MMP to increase PLBreach is to use large granularity entries (which is doneby Mondrix) or run-length encoded entries (e.g., vprand twolf from SPEC2000 [WCA02]). The PLB missrate for Mondrix was lower than 1% for all workloadsand the execution penalty for PLB refill was less than4% of execution time because kernel text and datasections are represented with a single entry, and asmentioned in the previous section, the kernel mem-ory allocator was modified to manage protections atthe granularity of a page.
5
When each user allocation for vpr and twolf is pro-tected by inaccessible words at the start and end ofthe allocation, the system spends 10–20% of its mem-ory references refilling the PLB [WCA02] when us-ing bitmap entries. Run-length encoded entries in-crease PLB reach by effectively encoding large usersegments that share a permissions value with the startof the entry and/or its end. Using run-length en-coded entries, memory accesses to the permissionstable drop to 7.5% for both SPEC2000 benchmarks.As §3.2 discusses, RLE entries encode overlappingpermissions information. A single RLE entry cancontain the permissions information from multiplebitmapped entries, eliminating PLB refills.
Because Colorama only monitors shared data ac-cesses, that decreases traffic to the on-chip meta-data cache (the Colorama PLB). The Colorama im-plementation uses run-length encoded entries (calledmini-SST entries in the original design [WCA02]),which should be effective at making PLB reach largeenough for high performance. Additionally, the au-thors mention that color metadata could be aggre-gated into larger granularity chunks, by doing pooledmemory allocation.
Loki’s support for page-granularity metadata tagsis crucial to keeping its runtime overheads low. Forone fork/exec benchmark, page-granularity tagsreduces the time overhead from 55% to 1%. It hasan 8-entry cache to map from physical page to tagor pointer to a page of fine-grained tags. The fine-grained tags are stored in the CPU’s cache. Thephysical address to tag map does not need to beflushed on a context switch. Loki also has a 32-entry2-way associative cache that maps tags to permis-sions. This cache does need to be flushed on contextswitches, but does not need to be flushed when mem-ory changes tags.
4.3 Software overheads
In an MMP-like design, metadata is managed likepage tables are managed, software writes table entriesthat are read by hardware. Mondrix writes protec-tion tables frequently to protect memory allocations,to protect network packets, and to protect argumentsto cross-domain calls. The time for software to writethe tables can become a significant performance cost,up to 10% of the kernel execution time in one Mon-
drix workload.It is possible that a given application for an MMP-
like system will have infrequent metadata updates.Having software encode table entries is a good choicefor systems that update metadata infrequently be-cause software is so flexible. However, we found thatthe only reliable technique for evaluating the cost ofthe software encoding is to implement it and run iton realistic inputs. The software entry encoding doesnot need to play a functional role in the system, but itis necessary for benchmarking.
The MMP ASPLOS paper [WCA02] does notevaluate the cost of writing table entries in softwareas it is a typical hardware evaluation paper that lackssystem software support. In its defense, the sys-tem software required years of development effort,though effort to develop the table-entry encoder wasa small fraction of that time. One unexpected conse-quence of writing the software to encode table entriesis measuring the high runtime cost of writing run-length encoded entries. On one trace of memory pro-tection calls extracted from Mondrix execution, writ-ing run-length encoded entries is three times slowerthan writing bitmap entries. The run-length encodedentries are slow for software to write because they arecomplicated to encode, and because they overlap up-dates to an entry requires complicated logic for break-ing and coalescing adjacent entries. While we devel-oped and debugged the code to write run-length en-coded entries (a task that required a solid month), wenever deployed it in Mondrix because of its poor per-formance. Also, Mondrix had enough coarse-grainedallocations that it did not need run-length encoded en-tries. Because of our experience with the software,we believe that any MMP implementation with run-length encoded entries will require hardware to en-code the table entries.
Run-length encoded entries might be effective forColorama, because the metadata update rate shouldbe lower than Mondrix’s. Mondrix writes the permis-sions table on memory allocations, and also duringdata structure processing (e.g., packet reception) andfor cross-domain calls. The Colorama implementa-tion measures low allocation rates for some applica-tions (every 129K-288M instructions), and high ratesfor others, every 2–4K instructions. The authors con-servatively assume that every allocation is for coloreddata, while the true rate for changing the color table
6
might be lower. The encoding costs for the run-lengthencoded entries might be an issue if the color tablesare actually updated every 2–4K instructions.
Loki’s simple data layout can be efficiently writtenby software. Page-granularity tags are held in an ar-ray, and fine-grained tags occupy the same offset in apage as their associated data.
Summary. The trade-offs among space overhead,PLB reach and software overheads are complex fora high-performance MMP-like system. ApplyingMMP to SPEC2000 and to the Linux kernel resultedin different trade-offs. For projects that only tangen-tially involve an MMP-like structure, the details ofthese trade-offs is out of scope. However, a high-levelargument for the plausibility of a specific applicationis necessary to make an argument for the efficienciesof an MMP implementation.
5 Conclusion
The hardware and software designs for Mondriaanmemory protection can be used to associate arbitrarymetadata with individual user words at reasonablestorage and execution time costs. However, keepingthose costs limited requires careful design. The orig-inal MMP design makes assumptions that follow-onwork may violate.
We encourage others to use MMP-like structures,and to include a discussion about space overhead,PLB reach, and software overheads. We hope this pa-per can act as a guide. The original MMP design lim-its space overhead to 6% by using 2 metadata bits foreach data word. It increases PLB reach either by us-ing run-length encoded entries or by relying on largeuser segments. MMP limits software overheads bywriting bitmaps in software and run-length encodedentries in hardware.
Acknowledgments
We thank Nickolai Zeldovich and Luis Ceze for theirconsidered comments on a draft of this paper. Thanksalso to Owen Hofmann and the anonymous refereesfor their constructive comments.
References
[Bon94] Jeff Bonwick. The slab allocator: Anobject-caching kernel memory alloca-tor. In USENIX Summer, 1994.
[CMvPT07] Luis Ceze, Pablo Montesinos,Christoph von Praun, and Josep Torrel-las. Colorama: Architectural supportfor data-centric synchronization. InIEEE International Symposium on HighPerformance Computer Architecture,2007.
[WA03] Emmett Witchel and Krste Asanovic.Hardware works, software doesn’t: En-forcing modularity with Mondriaanmemory protection. InHotOS, 2003.
[WCA02] Emmett Witchel, Josh Cates, and KrsteAsanovic. Mondrian memory protec-tion. In 10th International Conferenceon Architectural Support for Program-ming Languages and Operating Sys-tems, Oct 2002.
[Wit04] Emmett Witchel. Mondriaan MemoryProtection. PhD thesis, Massachus-setts Institute of Technology, January2004.
[WLAA01] Emmett Witchel, Sam Larsen, C. ScottAnanian, and Krste Asanovic. Directaddressed caches for reduced powerconsumption. InProceedings of the34th Annual International Symposiumon Microarchitecture (MICRO-34), De-cember 2001.
[WRA05] Emmett Witchel, Junghwan Rhee, andKrste Asanovic. Mondrix: memoryisolation for Linux using Mondriaanmemory protection. InProceedings ofthe twentieth ACM symposium on Oper-ating systems principles, 2005.
[ZKDK08] Nickolai Zeldovich, Hari Kan-nan, Michael Dalton, and ChristosKozyrakis. Hardware enforcementof application security policies usingtagged memory. InOperating SystemsDesign and Implementation, 2008.
7
Is Transactional Programming Actually Easier?
Christopher J. Rossbach, Owen S. Hofmann, and Emmett WitchelDepartment of Computer Science, University of Texas at Austin
{rossbach,osh,witchel}@cs.utexas.edu
Abstract
Chip multi-processors (CMPs) have become ubiquitous,while tools that ease concurrent programming have not.The promise of increased performance for all applicationsthrough ever more parallel hardware requires good toolsfor concurrent programming, especially for average pro-grammers. Transactional memory (TM) has enjoyed re-cent interest as a tool that can help programmers programconcurrently.
The TM research community claims that programmingwith transactional memory is easier than alternatives (likelocks), but evidence is scant. In this paper, we describe auser-study in which 147 undergraduate students in an op-erating systems course implemented the same programsusing coarse and fine-grain locks, monitors, and trans-actions. We surveyed the students after the assignment,and examined their code to determine the types and fre-quency of programming errors for each synchronizationtechnique. Inexperienced programmers found baroquesyntax a barrier to entry for transactional programming.On average, subjective evaluation showed that studentsfound transactions harder to use than coarse-grain locks,but slightly easier to use than fine-grained locks. De-tailed examination of synchronization errors in the stu-dents’ code tells a rather different story. Overwhelm-ingly, the number and types of programming errors thestudents made was much lower for transactions than forlocks. On a similar programming problem, over 70% ofstudents made errors with fine-grained locking, while lessthan 10% made errors with transactions.
1 Introduction
Transactional memory (TM) has enjoyed a wave of atten-tion from the research community. The increasing ubiq-uity of chip multiprocessors has resulted in a high avail-ability of parallel hardware resources, without many con-current programs. TM researchers position TM as anenabling technology for concurrent programming for the“average” programmer.
Transactional memory allows the programmer to de-limit regions of code that must execute atomically and in
isolation. It promises the performance of fine-grain lock-ing with the code simplicity of coarse-grain locking. Incontrast to locks, which use mutual exclusion to serializeaccess to critical sections, TM is typically implementedusing optimistic concurrency techniques, allowing criticalsections to proceed in parallel. Because this technique dra-matically reduces serialization when dynamic read-writeand write-write sharing is rare, it can translate directlyto improved performance without additional effort fromthe programmer. Moreover, because transactions elimi-nate many of the pitfalls commonly associated with locks(e.g. deadlock, convoys, poor composability), transac-tional programming is touted as being easier than lockbased programming.
Evaluating the ease of transactional programming rel-ative to locks is largely uncharted territory. Naturally,the question of whether transactions are easier to usethan locks is qualitative. Moreover, since transactionalmemory is still a nascent technology, the only availabletransactional programs are research benchmarks, and thepopulation of programmers familiar with both transac-tional memory and locks for synchronization is vanish-ingly small.
To address the absence of evidence, we developed aconcurrent programming project for students of an under-graduate Operating Systems course at the University ofTexas at Austin, in which students were required to imple-ment the same concurrent program using coarse and fine-grained locks, monitors, and transactions. We surveyedstudents about the relative ease of transactional program-ming as well as their investment of development effortusing each synchronization technique. Additionally, weexamined students’ solutions in detail to characterize andclassify the types and frequency of programming errorsstudents made with each programming technique.
This paper makes the following contributions:
• A project and design for collecting data relevant tothe question of the relative ease of programming withdifferent synchronization primitives.
• Data from 147 student surveys that constitute thefirst (to our knowledge) empirical data relevant to thequestion of whether transactions are, in fact, easier touse than locks.
1
Figure 1: A screen-shot of sync-gallery, the program undergraduate OS students were asked to implement. In thefigure the colored boxes represent 16 shooting lanes in a gallery populated by shooters, orrogues. A red or blue boxrepresents a box in which a rogue has shot either a red or blue paint ball. A white box represents a box in which noshooting has yet taken place. A purple box indicates a line inwhich both a red and blue shot have occurred, indicatinga race condition in the program. Sliders control the rate at which shooting and cleaning threads perform their work.
• A taxonomy of synchronization errors made with dif-ferent synchronization techniques, and a characteri-zation of the frequency with which such errors occurin student programs.
2 Sync-gallery
In this section, we describe sync-gallery, the Java pro-gramming project we assigned to students in an under-graduate operating systems course. The project is de-signed to familiarize students with concurrent program-ming in general, and with techniques and idioms for us-ing a variety of synchronization primitives to manage datastructure consistency. Figure 1 shows a screen shot fromthe sync-gallery program.
The project asks students to consider the metaphor of ashooting gallery, with a fixed number of lanes in whichrogues (shooters) can shoot in individual lanes. Beingpacifists, we insist that shooters in this gallery use red orblue paint balls rather than bullets. Targets are white, sothat lanes will change color when a rogue has shot in one.Paint is messy, necessitatingcleaners to clean the gallerywhen all lanes have been shot. Rogues and cleaners areimplemented as threads that must check the state of oneor more lanes in the gallery to decide whether it is safeto carry out their work. For rogues, this work amountsto shooting at some number of randomly chosen lanes.Cleaners must return the gallery to it’s initial state with alllanes white. The students must use various synchroniza-tion primitives to enforce a number of program invariants:
1. Only one rogue may shoot in a given lane at a time.2. Rogues may only shoot in a lane if it is white.3. Cleaners should only clean when all lanes have
been shot (are non-white).4. Only one thread can be engaged in the process of
cleaning at any given time.
If a student writes code for a rogue that fails to respectthe first two invariants, the lane can be shot with both redand blue, and will therefore turn purple, giving the studentinstant visual feedback that a race condition exists in theprogram. If the code fails to respect to the second twoinvariants, no visual feedback is given (indeed these in-variants can only be checked by inspection of the code inthe current implementation).
We ask the students to implement 9 different versionsof rogues (Java classes) that are instructive for differentapproaches to synchronization. Table 1 summarizes therogue variations. Gaining exclusive access to one or twolanes of the gallery in order to test the lane’s state and thenmodify it corresponds directly to the real-world program-ming task of locking some number of resources in order totest and modify them safely in the presence of concurrentthreads.
2.1 Locking
We ask the students to synchronize rogue and cleanerthreads in the sync-gallery using locks to teach themabout coarse and fine-grain locking. To ensure that stu-dents write code that explicitly performs locking andunlocking operations, we require them to use the JavaReentrantLock class and do not allow use of thesynchronized keyword. In locking rogue variations,cleaners do not use dedicated threads; the rogue that col-ors the last white lane in the gallery is responsible forbecoming a cleaner and subsequently cleaning all lanes.There are four variations on this rogue type:Coarse, Fine,Coarse2 and Fine2. In the coarse implementation, stu-dents are allowed to use a single global lock which is ac-quired before attempting to shoot or clean. In the fine-grain implementation, we require the students to imple-ment individual locks for each lane. The Coarse2 andFine2 variations require the same mapping of locks to ob-
2
f i n a l i n t x = 10 ;C a l l a b l e c = new C a l l a b l e<Void> {
p u b l i c Void c a l l ( ) {/ / t x n l codey = x ∗ 2 ;r e t u r n n u l l ;
}}Thread . d o I t ( c ) ;
T r a n s a c t i o n t x = new T r a n s a c t i o n ( i d ) ;boo lean done = f a l s e ;wh i l e ( ! done ) {
t r y {t x . B e g i n T r a n s a c t i o n ( ) ;/ / t x n l codedone = t x . Commi tTransac t ion ( ) ;
} c a t c h ( Abo r tExcep t i on e ){t x . A b o r t T r a n s a c t i o n ( ) ;done = f a l s e ;
}}
Figure 2:Examples of (left) DSTM2 concrete syntax, and (right) JDASTMconcrete syntax.
jects in the gallery as their counterparts above, but intro-duce the additional stipulation that rogues must acquireaccess to and shoot at two random lanes rather than one.The pedagogical value is illustration that fine-grain lock-ing requires a lock-ordering discipline to avoid deadlock,while a single coarse lock does not. Naturally, the use offine grain lane locks complicates the enforcement of in-variants 3 and 4 above.
2.2 Monitor implementations
Students must use condition variables along with sig-nal/wait to implement both fine and coarse locking ver-sions of the rogue programs. These two variations intro-duce dedicated threads for cleaners: shooters and cleanersmust use condition variables to coordinate shooting andcleaning phases. In the coarse version (CoarseCleaner),students use a single global lock, while the fine-grain ver-sion (FineCleaner) requires per-lane locks.
2.3 Transactions
Finally, the students are asked to implement 3 TM-basedvariants of the rogues that share semantics with some lock-ing versions, but use transactional memory for synchro-
nization instead of locks. The most basic TM-based rogue,TM, is analogous to the Coarse and Fine versions: rogueand cleaner threads are not distinct, and shooters needshoot only one lane, while theTM2 variation requires thatrogues shoot at two lanes rather than one. In theTM-Cleaner, rogues and cleaners have dedicated threads. Stu-dents can rely on the TM subsystem to detect conflicts andrestart transactions to enforce all invariants, so no condi-tion synchronization is required.
2.4 Transactional Memory Support
Since sync-gallery is a Java program, we were faced withthe question of how to support transactional memory. Theideal case would have been to use a software transactionalmemory (STM) that provides support for atomic blocks,allowing students to write transactional code of the form:
vo id sh o o t ( ) {a tomic {
Lane l = getLane ( rand ( ) ) ;i f ( l . g e t C o l o r ( ) == WHITE)
l . sh o o t ( t h i s . c o l o r ) ;}
}
Rogue name Technique R/C Threads Additional RequirementsCoarse Single global lock not distinct.
Coarse2 Single global lock not distinct rogues shoot at 2 random lanesCoarseCleaner Single global lock, conditions distinct conditions, wait/notify
Fine Per lane locks not distinctFine2 Per lane locks not distinct rogues shoot at 2 random lanes
FineCleaner Per lane locks, conditions distinct conditions, wait/notifyTM TM not distinct
TM2 TM not distinct rogues shoot at 2 random lanesTMCleaner TM distinct
Table 1: The nine different rogue implementations requiredfor the sync-gallery project. The technique column in-dicates what synchronization technique was required. The R/C Threads column indicates whether coordination wasrequired between dedicated rogue and cleaner threads or not. A value of “distinct” means that rogue and cleaner in-stances run in their own thread, while a value of “not distinct” means that the last rogue to shoot an empty (white) laneis responsible for cleaning the gallery.
3
No such tool is yet available; implementing compilersupport for atomic blocks, or use of a a source-to-sourcecompiler such as spoon [1] were considered out-of-scopefor the project. The trade-off is that students are forced todeal directly with the concrete syntax of our TM imple-mentation, and must manage read and write barriers ex-plicitly. We assigned the lab to 4 classes over 2 semesters.During the first semester both classes used DSTM2 [14].For the second semester, both classes used JDASTM [24].
The concrete syntax has a direct impact on ease of pro-gramming, as seen in Figure 2. Both examples pepperthe actual data structure manipulation with code that ex-plicitly manages transactions. We replaced DSTM2 in thesecond semester because we felt that JDASTM syntax wassomewhat less baroque and did not require students todeal directly with programming constructs like generics.Also, DSTM2 binds transactional execution to specializedthread classes. However, both DSTM2 and JDASTM re-quire explicit read and write barrier calls for transactionalreads and writes.
3 Methodology
Students completed the sync-gallery program as a pro-gramming assignment as part of several operating systemsclasses at the University of Texas at Austin. In total, 147students completed the assignment, spanning two sectionseach in classes from two different semesters of the course.The semesters were separated by a year. We provided animplementation of the shooting gallery, and asked studentsto write the rogue classes described in the previous sec-tions, respecting the given invariants.
We asked students to record the amount of time theyspent designing, coding, and debugging each program-ming task (rogue). We use the amount of time spent oneach task as a measure of the difficulty that task presentedto the students. This data is presented in Section 4.1. Af-ter completing the assignment, students rated their famil-
iarity with concurrent programming concepts prior to theassignment. Students then rated their experience with thevarious tasks, ranking synchronization methods with re-spect to ease of development, debugging, and reasoning(Section 4.2).
While grading the assignment, we recorded the type andfrequency of synchronization errors students made. Theseare the errors still present in the student’s final version ofthe code. We use the frequency with which students madeerrors as another metric of the difficulty of various syn-chronization constructs.
To prevent experience with the assignment as a wholefrom influencing the difficulty of each task, we askedstudents to complete the tasks in different orders. Ineach group of rogues (single-lane, two-lane, and separatecleaner thread), students completed the coarse-grainedlock version first. Students then either completed thefine-grained or TM version second, depending on theirassigned group. We asked students to randomly assignthemselves to groups based on hashes of their name. Dueto an error, nearly twice as many students were assigned tothe group completing the fine-grained version first. How-ever, there were no significant differences in programmingtime between the two groups, suggesting that the order inwhich students implemented the tasks did not affect thedifficulty of each task.
3.1 Limitations
Perhaps the most important limitation of the study is themuch greater availability of documentation and tutorial in-formation about locking than about transactions. The nov-elty of transactional memory made it more difficult bothto teach and learn. The concrete syntax of transactions isalso a barrier to ease of understanding and use (see§4.2).Lectures about locking drew on a larger body of under-standing that has existed for a longer time. It is unlikelythat students from one year influenced students from the
Figure 3: Average design, coding, and debugging time spent for analogous rogue variations.
4
Figure 4: Distributions for the amount of time students spent coding and debugging, for all rogue variations.
next year given the difference in concrete syntax betweenthe two courses.
4 Evaluation
We examined development time, user experiences, andprogramming errors to determine the difficulty of pro-gramming with various synchronization primitives. Ingeneral, we found that a single coarse-grained lock hadsimilar complexity to transactions. Both of these primi-tives were less difficult, caused fewer errors, and had bet-ter student responses than fine-grained locking.
4.1 Development time
Figures 4 and 3 characterize the amount of time thestudents spent designing, coding and debugging witheach synchronization primitive. On average, transactionalmemory required more development time than coarselocks, but less than required for fine-grain locks and condi-tion synchronization. With more complex synchronizationtasks, such as coloring two lanes and condition synchro-nization, the amount of time required for debugging in-creases relative to the time required for design and coding(Figure 3).
We evaluate the statistical significance of differences indevelopment time in Table 2. Using a Wilcoxon signed-rank test, we evaluated the alternative hypothesis on eachpair of synchronization tasks that the row task required
less time than the column task. Pairs for which the signed-rank test reports a p-value of< .05 are considered statisti-cally significant, indicating that the row task required lesstime than the column. If the p-value is greater than .05,the difference in time for the tasks is not statistically sig-nificant or the row task required more time than the col-umn task. Results for the different class years are sep-arated due to differences in the TM part of the assign-ment(Section 2.4).
We found that students took more time to develop theinitial tasks while familiarizing themselves with the as-signment. Except for fine-grain locks, later versions ofsimilar synchronization primitives took less time thanearlier, e.g. the Coarse2 task took less time than theCoarse task. In addition, condition synchronization is dif-ficult. For both rogues with less complex synchroniza-tion (Coarse and TM), adding condition synchronizationincreases the time required for development. For fine-grain locking, students simply replace one complex prob-lem with a second, and so do not require significant addi-tional time.
In both years, we found that coarse locks and transac-tions required less time than fine-grain locks on the morecomplex two-lane assignments. This echoes the promiseof transactions, removing the coding and debugging com-plexity of fine-grain locking and lock ordering when morethan one lock is required.
5
4.2 User experience
To gain insight into the students’ perceptions about therelative ease of using different synchronization techniqueswe asked the students to respond to a survey after com-pleting the sync-gallery project. The survey ends with 6questions asking students to rank their favorite techniquewith respect to ease of development, debugging, reasoningabout, and so on.
A version of the complete survey can be viewed at [2].In student opinions, we found that the more baroque
syntax of the DSTM2 system was a barrier to entry fornew transactional programmers. Figure 5 shows studentresponses to questions about syntax and ease of thinkingabout different transactional primitives. In the first classyear, students found transactions more difficult to thinkabout and had syntax more difficult than that of fine-grainlocks. In the second year, when the TM implementationwas replaced with one less cumbersome, student opinionsaligned with our other findings: TM ranked behind coarselocks, but ahead of fine-grain. For both years, other ques-tions on ease of design and implementation mirrored theseresults, with TM ranked ahead of fine-grain locks.
4.3 Synchronization Error Characteriza-tion
We examined the solutions from the second year’s class indetail to classify the types of synchronization errors stu-dents made along with their frequency. This involved botha thorough reading of every student’s final solutions andautomated testing. While the students’ subjective evalu-ation of the ease of transactional programming does not
clearly indicate that transactional programming is easier,the types and frequency of programming errors does.
While the students showed an impressive level of cre-ativity with respect to synchronization errors, we foundthat all errors fit within the taxonomy described below.
1. Lock ordering (lock-ord). In fine-grain locking so-lutions, a program failed to use a lock ordering dis-cipline to acquire locks, admitting the possibility ofdeadlock.
2. Checking conditions outside a critical section(lock-cond). This type of error occurs when codechecks a program condition with no locks held, andsubsequently acts on that condition after acquiringlocks. This was the most common error in sync-gallery, and usually occurred when students wouldcheck whether to clean the gallery with no locks held,subsequently acquiring lane locks and proceeding toclean. The result is a violation of invariant 4 (§2).This type of error may be more common because novisual feedback is given when it is violated (unlikeraces for shooting lanes, which can result in purplelanes).
3. Forgotten synchronization (lock-forgot). Thisclass of errors includes all cases where the program-mer forgot to acquire locks, or simply did not realizethat a particular region would require mutual exclu-sion to be correct.
4. Exotic use of condition variables (cv-exotic). Weencountered a good deal of signal/wait usage on con-dition variables that indicates no clear understandingof what the primitives actually do. The canonical ex-ample of this is signaling and waiting the same con-
Figure 5: Selected results from student surveys. Column numbers represent rank order, and entries represent whatpercentage of students assigned a particular synchronization technique a given rank (e.g. 80.8% of students rankedCoarse locks first in the “Easiest to think about category”).In the first year the assignment was presented, the morecomplex syntax of DSTM made TM more difficult to think about. In the second year, simpler syntax alleviated thisproblem.
Table 2: Comparison of time taken to complete programming tasks for all students. The time to complete the task onthe row is compared to the time for the task on the column. Eachcell contains p-values for a Wilcoxon signed-ranktest, testing the hypothesis that the row task took less timethan the column task. Entries are considered statisticallysignificant whenp < .05, meaning that the row task did take less time to complete thanthe column task, and aremarked in bold. Results for first and second class years are reported separately, due to differing transactional memoryimplementations.
dition in the same thread.5. Condition variable use errors (cv-use). These
types of errors indicate a failure to use condition vari-ables properly, but do indicate a certain level of un-derstanding. This class includes use ofif instead ofwhile when checking conditions on a decision towait, or failure to check the condition at all beforewaiting.
6. TM primitive misuse (TM-exotic). This class of er-ror includes any misuse of transactional primitives.Technically, this class includes mis-use of the API,but in practice the only errors of this form we sawwere failure to callBeginTransaction beforecallingEndTransaction. Omission of read/writebarriers falls within this class as well, but it is inter-esting to note that we found no bugs of this form.
7. TM ordering (TM-order). This class of errors rep-resents attempts by the programmer to follow somesort of locking discipline in the presence of trans-actions, where they are strictly unnecessary. Sucherrors do not result in an incorrect program, but dorepresent a misunderstanding of the primitive.
8. Forgotten TM synchronization (TM-forgot). Likethe forgotten synchronization class above (lock-forgot), these errors occur when a programmer failedto recognize the need for synchronization and did notuse transactions to protect a data structure.
Table 3 shows the characterization of synchronizationfor programs submitted in year 2. Figure 6 shows theoverall portion of students that made an error on each pro-gramming task. Students were far more likely to make anerror on fine-grain synchronization than on coarse or TM.
Table 3: Synchronization error rates for year 2. The occurrences row indicates the number of programs in which atleast one bug of the type indicated by the column header occurred. Theopportunities row indicates the sample size(the number of programs we examined in which that type of bug could arise: e.g. lock-ordering bugs cannot occur inwith a single coarse lock). Therate column expresses the percentage of examined programs containing that type ofbug. Bug types are explained in Section 4.3.
7
Coa
rse
Fin
e
TM
Coa
rse2
Fin
e2
TM
2
Coa
rseC
lean
er
Fin
eCle
aner
TM
Cle
aner
Pro
port
ion
of e
rror
s
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Figure 6: Overall error rates for programming tasks. Error bars show a 95% confidence interval on the error rate.Fine-grained locking tasks were more likely to contain errors than coarse-grained or transactional memory (TM).
About 70% of students made at least one error on the Fineand Fine2 portions of the assignment.
5 Related work
Hardware transactional memory research is an active re-search field with many competing proposals [4–7, 9–11,15–17, 19–23, 26]. All this research on hardware mech-anism is the cart leading the horse if researchers nevervalidate the assumption that transactional programming isactually easier than lock-based programming.
This research uses software transactional memory(which has no shortage of proposals [3, 12–14, 18, 25]),but its purpose is to validate how untrained programmerslearn to write correct and performant concurrent programswith locks and transactions. The programming interfacefor STM systems is the same as HTM systems, but with-out compiler support, STM implementations require ex-plicit read-write barriers, which are not required in anHTM. Compiler integration is easier to program than us-ing a TM library [8]. Future work research could inves-tigate whether compiler integration lowers the perceivedprogrammer difficulty in using transactions.
6 Conclusion
To our knowledge, no previous work directly addressesthe question of whether transactional memory actually de-livers on its promise of being easier to use than locks.This paper offers evidence that transactional program-ming really is less error-prone than high-performancelocking, even if newbie programmers have some troubleunderstanding transactions. Students subjective evalua-tion showed that they found transactional memory slightlyharder to use than coarse locks, and easier to use than fine-grain locks and condition synchronization. However, anal-ysis of synchronization error rates in students’ code yieldsa more dramatic result, showing that for similar program-ming tasks, transactions are considerably easier to get cor-rect than locks.
[3] A.-R. Adl-Tabatabai, B. Lewis, V. Menon, B. Murphy, B. Saha, andT. Shpeisman. Compiler and runtime support for efficient softwaretransactional memory. InPLDI, Jun 2006.
[4] Lee Baugh, Naveen Neelakantam, and Craig Zilles. Using hard-ware memory protection to build a high-performance, stronglyatomic hybrid transactional memory. InProceedings of the 35th
8
Annual International Symposium on Computer Architecture. June2008.
[5] Colin Blundell, Joe Devietti, E. Christopher Lewis, andMilo M. K.Martin. Making the fast case common and the uncommon casesimple in unbounded transactional memory.SIGARCH Comput.Archit. News, 35(2):24–34, 2007.
[6] Jayaram Bobba, Neelam Goyal, Mark D. Hill, Michael M. Swift,and David A. Wood. Tokentm: Efficient execution of large trans-actions with hardware transactional memory. InProceedings ofthe 35th Annual International Symposium on Computer Architec-ture. Jun 2008.
[7] J. Chung, C. Minh, A. McDonald, T. Skare, H. Chafi, B. Carlstrom,C. Kozyrakis, and K. Olukotun. Tradeoffs in transactional mem-ory virtualization. InASPLOS, 2006.
[8] Luke Dalessandro, Virendra J. Marathe, Michael F. Spear, andMichael L. Scott. Capabilities and limitations of library-basedsoftware transactional memory in c++. InProceedings of the 2ndACM SIGPLAN Workshop on Transactional Computing. Portland,OR, Aug 2007.
[9] L. Yen et al. Logtm-SE: Decoupling hardware transactional mem-ory from caches. InHPCA. 2007.
[10] Mark Moir et. al. Experiences with a commercial processorsup-porting htm. ASPLOS 2009.
[11] L. Hammond, V. Wong, M. Chen, B. Hertzberg, B. Carlstrom,M. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun. Transac-tional memory coherence and consistency. InISCA, 2004.
[12] T. Harris, M. Plesko, A. Shinnar, and D.Tarditi. Optimizing mem-ory transactions. InPLDI, Jun 2006.
[13] Tim Harris and Keir Fraser. Language support for lightweighttransactions. InOOPSLA, pages 388–402, Oct 2003.
[14] M. Herlihy, V. Luchangco, and M. Moir. A flexible framework forimplementing software transactional memory. InOOPSLA, pages253–262, 2006.
[15] M. Herlihy and J. E. Moss. Transactional memory: Architecturalsupport for lock-free data structures. InISCA, May 1993.
[16] Owen S. Hofmann, Christopher J. Rossbach, and Emmett Witchel.Maximal benefit from a minimal tm.ASPLOS.
[17] Yossi Lev and Jan-Willem Maessen. Split hardware transactions:true nesting of transactions using best-effort hardware transactionalmemory. InPPoPP ’08: Proceedings of the 13th ACM SIGPLANSymposium on Principles and practice of parallel programming,pages 197–206, New York, NY, USA, 2008. ACM.
[18] V. Marathe, M. Spear, C. Heriot, A. Acharya, D. Eisenstat,W. Scherer III, and M. Scott. Lowering the overhead of nonblock-ing software transactional memory. InTRANSACT, 2006.
[19] A. McDonald, J. Chung, B. Carlstrom, C. Minh, H. Chafi,C. Kozyrakis, and K. Olukotun. Architectural semantics for prac-tical transactional memory. InISCA, Jun 2006.
[20] K. E. Moore, J. Bobba, M. J. Moravan, M. D. Hill, , and D. A.Wood. Logtm: Log-based transactional memory. InHPCA, 2006.
[21] R. Rajwar and M. Herlihy K. Lai. Virtualizing transactional mem-ory. In ISCA, Jun 2005.
[22] H. Ramadan, C. Rossbach, D. Porter, O. Hofmann, A. Bhandari,and E. Witchel. Metatm/txlinux: Transactional memory for an op-erating system. InISCA, 2007.
[23] H. Ramadan, C. Rossbach, and E. Witchel. Dependence-awaretransactional memory for increased concurrency. InMICRO,2008.
[24] Hany E. Ramadan, Indrajit Roy, Maurice Herlihy, and EmmettWitchel. Committing conflicting transactions in an STM. InPPoPP, 2009.
[25] Nir Shavit and Dan Touitou. Software transactional memory. InProceedings of the 14th ACM Symposium on Principles of Dis-tributed Computing, pages 204–213, Aug 1995.
[26] Arrvindh Shriraman, Sandhya Dwarkadas, and Michael L. Scott.Flexible decoupled transactional memory support. InProceedingsof the 35th Annual International Symposium on Computer Archi-tecture. Jun 2008.