Regulating Concurrency in Software Transactional Memory

Regulating Concurrency in Software Transactional Memory: An EffectiveModel-based Approach

Pierangelo Di Sanzo, Francesco Del Re, Diego Rughetti, Bruno Ciciani, Francesco QuagliaDIAG, Sapienza University of Rome

Abstract—Software Transactional Memory (STM) is recog-nized as an effective programming paradigm for concurrentapplications. On the other hand, a core problem to copewith in STM deals with (dynamically) regulating the degreeof concurrency, in order to deliver optimal performance. Weaddress this problem by proposing a self-regulation approachof the concurrency level, which relies on a parametric analyticalperformance model aimed at predicting the scalability of theSTM application as a function of the actual workload profile.The regulation scheme allows achieving optimal performanceduring the whole lifetime of the application via dynamicchange of the number of concurrent threads according tothe predictions by the model. The latter is customized fora specific application/platform through regression analysis,which is based on a lightweight sampling phase. We alsopresent a real implementation of the model-based concurrencyself-regulation architecture integrated within the open sourceTinySTM framework, and an experimental study based onstandard STM benchmark applications.

I. INTRODUCTION

Software Transactional Memory (STM) [17] has emergedas a promising paradigm aiming at simplifying the de-velopment of parallel/concurrent applications. By relyingon the concept of atomic transaction, STM represents afriendly alternative to traditional lock-based synchronization.More in detail, code blocks accessing shared data can bemarked as transactions, thus demanding coherency of thedata access/manipulation to the STM layer, rather than toany handcrafted synchronization scheme. The relevance ofthe STM paradigm has significantly grown given that multi-core systems have become mainstream platforms, so thateven entry-level desktop and laptop machines are nowadaysequipped with multiple processors and/or CPU-cores.

Even though one main target for STM is the simplificationof the software development process, another aspect thatis central for the success of the STM paradigm relates tothe actual level of performance it can deliver. As for thisaspect, STM needs to be complemented by schemes aimed atallowing the overlying application to reach optimal speedupvalues thanks to fruitful parallelism exploitation. This is-sue arises since STM applications are prone to thrashingphenomena (caused by excessive transaction rollbacks) in

This work has been partially supported by the Cloud-TM project (co-financed by the European Commission through the contract no. 57784) andby COST Action IC1001 EuroTM. The software produced as a result ofthe research activity presented in this paper can be downloaded at the URLhttp://www.dis.uniroma1.it/∼hpdcs/CSR-STM.tar

case the data access pattern tends to exhibit non-negligibleconflict among concurrent transactions and the degree ofparallelism in the execution is excessively high. On the otherhand, for too low parallelism levels, the achievable speedupmay still be suboptimal.

Recent approaches coping with this problem have beentargeted at selecting/controlling the degree of parallelismby (dynamically) determining the well suited number ofconcurrent threads to sustain application execution. Alongthis path we can find solutions ranging from analyticalmodels [13], [4], to heuristic-based schemes [7], to machinelearning approaches [14]. On the other hand, all of theproposed approaches exhibit some shortcoming. Classicalanalytical approaches are in fact know to become unreliableas soon as the assumptions they rely on (e.g. in terms of dataaccess distribution and/or distribution of the CPU time forspecific operations) are not met. Further, according to theoutcomes in [6], the transaction abort rate can be stronglyaffected by the order according to which data are accessedalong the transaction execution path, which is typicallyneglected by analytical models. On the other hand, even incase the effects of such an ordering are captured analytically,the actual exploitation of the performance model wouldrequire detailed knowledge of the data access pattern for thespecific application, which may be unavailable or arduous tobuild. As for heuristic and/or machine learning approaches,they do not require specific (stringent) assumptions to be metin relation to, e.g., the transactional profile of the application.Hence, they exhibit the potential for high effectiveness ingeneric application contexts, and for generic computingplatforms. On the other hand, these approaches may showlimited extrapolation capabilities, thus not being fully suitedfor forecasting the performance that would be achieved withlevels of concurrency not belonging to the already exploreddomain (e.g. the training domain in case of neural networkbased approaches). Further, the time required for buildingthe knowledge base to be exploited by the machine learnermay be non-minimal, which may make the actuation of theoptimized concurrency configuration untimely.

In this article we tackle the issue of regulating theconcurrency level in STM via a model-based approach,which differentiates from classical ones in that it avoidsthe need for the STM system to meet specific assumptions(e.g. in terms of data access pattern). Our proposal relies ona parametric analytical expression capturing the expected

trend in the transaction abort probability (versus the degreeof concurrency) as a function of a set of features associatedwith the actual workload profile. The parameters appearingwithin the model exactly aim at capturing execution dy-namics and effects that are hard to be expressed throughclassical (non-parametric) analytical modeling approaches.We derived the parametric expression of the transaction abortprobability via combined exploitation of literature resultsin the field of analytical modeling and a simulation-basedanalysis. Further, the parametric model is thought to beeasily customizable for a specific STM system by calculatingthe values to be assigned to the parameters (hence byinstantiating the parameters) via regression analysis. Thelatter can be performed by exploiting a set of samplingdata gathered through run-time observations of the STMapplication. However, differently from what happens for thetraining process in machine learning approaches, the actualsampling phase (needed to provide the knowledge basefor regression in our approach) is very light. Specifically,a very limited number of profiling samples, related to afew different concurrency levels for the STM system, likelysuffices for successful instantiation of the model parametersvia regression. Finally, our approach inherits the extrapola-tion capabilities proper of pure analytical models (althoughit does not require their typical stringent assumptions tobe met, as already pointed out), hence allowing reliableperformance forecast even for concurrency levels standingdistant from the ones for which sampling was actuated.

A bunch of experimental results achieved by runningthe STAMP benchmark suite [2] on top of the TinySTMopen source framework [11] are reported for validating theproposed modeling approach. Further, we present the imple-mentation of a concurrency self-regulating STM, exploitingthe proposed performance model, still relying on TinySTMas the core STM layer, and we report experimental data foran assessment of this architecture.

The remainder of this paper is organized as follows.In Section II, literature results related to our proposal arediscussed. Section III is devoted to describing and validatingour STM performance model. The STM architecture entail-ing self-regulation capabilities of the concurrency level ispresented and evaluated in Section IV.

II. RELATED WORK

Our proposal has relations with literature results in thefield of analytical modeling of concurrency control protocolsfor transactional systems. These include performance modelsfor traditional database systems and related concurrencycontrol mechanisms (see, e.g., [18], [22]) and approachesspecifically targeting STM (see, e.g., [4]). Some of the litera-ture analytical models rely on (stringent) assumptions on thetransaction data access pattern, such as uniformly distributedaccesses (e.g. [22], [10]) or the b-c access model (e.g. [18],[19]). Differently from all these works, our proposal does not

assume any specific distribution for the data accesses, thusbeing more general and exploitable in generic applicationcontexts. Other literature models are able to capture morecomplex data access patterns by assuming Zipf-distributedaccesses [5] or phase-based accesses [6]. Compared to thesesolutions, our proposal avoids the need for any detailed char-acterization of the data access distribution. As a reflection,the instantiation of the parameters appearing in our modelrequires a lighter application sampling process than whatrequired to instantiate the actual data access distribution.

In [8] the authors propose a technique to approximatethe performance of the STM application when consideringdifferent amounts of concurrent threads. The technique isbased on the usage of different types of functions, suchas polynomial, rational and logarithmic functions. The ap-proximation process relies on sampling the speed-up of theapplication over a set of runs, each one executed with adifferent number of concurrent threads. After, the speed-up forecasting function is instantiated by interpolating themeasurements. Compared to our proposal, a limitation ofthis approach lies on that the workload profile of theapplication is not taken into account while instantiating theperformance forecasting function. This may lead to reducedreliability of the forecasting outcome, especially when theworkload profile of the application changes.

As for machine learning, it has been used in [20] forselecting the best performing conflict detection and man-agement algorithm. Conversely, it has been used in [3] toselect suitable mappings of threads to CPU-cores, allowingperformance improvements thanks to increased effectivenessof the caching system. The goal of both these works isdifferent and orthogonal with respect to our one since wefocus on the regulation of the overall concurrency levelin the STM system. To the best of our knowledge, theonly machine learning based approach targeting this sameproblem has been presented in [14]. Compared to thissolution, our proposal relies on a sampling process thatis lighter than the one required for building the machinelearning based performance model via training.

In [7] a black-box approach is proposed, based on the hill-climbing heuristic scheme, which dynamically increases ordecreases the level of concurrency. Particularly, the approachdetermines whether the trend of increasing/decresing theconcurrency level has positive effects on the STM through-put, in which case the trend is maintained. Differentlyfrom our proposal, no direct attempt to capture the relationbetween the actual transaction profile and the achievableperformance (depending on the level of parallelism) is done.

Given that our model-based approach is ultimately aimedat regulating concurrency so to avoid thrashing phenomena,our proposal is related to pro-active transaction schedulingschemes, which cope with the issue of performance degra-dation due to excessive data contention [1], [21], [9]. Thesesolutions avoid scheduling the execution of transactions

whose associated conflict probability is estimated to be high.The work in [1] presents a control algorithm that dynami-cally changes the number of threads concurrently executingtransactions on the basis of the observed transaction conflictrate (by decreasing/increasing the level of concurrency whenthe conflict rate exceeds/undergoes some threshold). In [21],incoming transactions are enqueued and sequentialized whenan indicator, referred to as contention-intensity (calculated asa dynamic average depending on the number of aborted vscommitted transactions), exceeds a pre-determined thresh-old. In [9], a transaction is sequentialized when a potentialconflict with other running transactions is predicted. Theprediction relies on the estimation of the expected transac-tion read-set and write-set. The sequentializing mechanismis activated only when the amount of aborted vs commit-ted transactions exceeds a given threshold. Compared toour model-based approach, all the above proposals do notdirectly estimate the likelihood of transaction aborts as afunction of the level of concurrency. Rather, they attempt tocontrol the wasted time in an indirect manner via heuristics.

III. THE PARAMETRIC PERFORMANCE MODEL

As already hinted, we decided to exploit a model relyingon a parametric analytical expression which captures theexpected trend of the transaction abort probability as afunction of (1) a set of features characterizing the currentworkload profile, and (2) the number of concurrent threadssustaining the STM application. The parameters in theanalytical expression aim at capturing effects that are hardto express through a classical (non-parametric) analyticalmodeling approach. Further, they are exploited to customizethe model for a specific STM application through regressionanalysis, which is done by exploiting a set of sampling datagathered through run-time observations of the application.In the remainder of this section we provide the basicassumptions on the behavior of the STM application, whichare exploited while building the parametric analytical model.Then the actual construction of the model is presented,together with a model validation study.

A. Basic Assumptions

The STM application is assumed to be run with a numberk of concurrent threads. The execution flow of each threadis characterized by the interleaving of transactions and non-transactional code (ntc) blocks. This is the typical structurefor common STM applications, which also reflects the oneof widely diffused STM benchmarks (see, e.g., [2]). Thetransaction read-set (write-set) is the set of shared data-objects that are read (written) by the thread while running atransaction. If a conflict between two concurrent transactionsoccurs, then one of the conflicting transactions is abortedand re-started (which leads to a new transaction run). Afterthe thread commits a transaction, it executes a ntc block,which ends before the execution of the begin operation ofthe subsequent transaction along the same thread.

B. Model Construction

The set P of features exploited for the construction of theparametric analytical model, which are used to capture theworkload profile, consists of:

• the average size of the transaction read-set rss;• the average size of the transaction write-set wss;• the average execution time tt of committed transaction

runs (i.e. the average duration of transaction runs thatare not aborted);

• the average execution time tntc of ntc blocks;• the read/write affinity rwa, namely the probability that

an object read by a transaction is also written by othertransactions;

• the write/write affinity wwa, namely the probability thatan object written by a transaction is also written byother transactions.

Operatively, rwa can be calculated as the dot productbetween the distribution of read operations and the distribu-tion of write operations (both expressed in terms of relativefrequency of accesses to shared data objects). Similarly, wwa

can be calculated as the dot product between the distributionof write operations and itself.

Our parametric analytical model expresses the transactionabort probability pa as a function of the features belongingto the set P , and the number k of concurrent treads supposedto run the STM application. Specifically, it instantiates (in aparametric manner) the function

pa = f(rss, wss, rwa, wwa, tt, tntc, k) (1)

Leveraging literature models proposing approximated per-formance analysis for transaction processing systems (see[22], [16]), we express the transaction abort probability pathrough the function

pa = 1− e−α (2)

However, while in literature results the parameter α isexpressed as the multiplication of parameters directly rep-resenting, e.g., the data access pattern and the workloadintensity (such as the transaction arrival rate λ for the caseof open systems), in our approach we express α as themultiplication of different functions that depend on the set offeatures appearing in equation (1). Overall, our expressionfor pa is structured as follows

pa = 1− e−ρ·ω·ϕ (3)

where the function ρ is assumed to depend on the inputparameters rss, wss, rwa and wwa, the function ω isassumed to depend on the parameter k, and the functionϕ is assumed to depend on the parameters tt and tntc.

We note that equation (2) has been derived in literaturewhile modeling the abort probability for the case optimisticconcurrency control schemes, where transactions are aborted(and restarted) right upon conflict detection. Consequently,

this expression for pa and the variation we propose in equa-tion (3) are expected to well match the STM context, wherepessimistic concurrency control schemes (where transactionscan experience lengthy lock-wait phases upon conflicting)are not used since they would limit the exploitation of par-allelism in the underlying architecture. More specifically, intypical STM implementations (see, e.g., [11]), transactionsare immediately aborted right upon executing an invalid readoperation. Further, they are aborted on write-lock conflictseither immediately or after a very short wait-time.

The model we propose in equation (3) is parametricthanks to expressing α as the multiplication of parametricfunctions that depend on a simple and concise representationof the workload profile (via the features in the set P ) andon the level of parallelism. This provides it with the abilityto capture variations of the abort probability (e.g. vs thedegree of parallelism) for differentiated application profiles.Particularly, different applications may exhibit similar valuesfor the featuring parameters in the set P , but may anyhowexhibit different dynamics, leading to a different curve forpa while varying the degree of parallelism. This is catchableby our model via application-specific instantiation of theparameters characterizing the functions ρ, ω and ϕ, whichcan be done through regression analysis. In the next sectionwe discuss how we have derived the actual ρ, ω and ϕfunctions, hence the actual function expressing α.

C. Instantiating ρ, ω and ϕ

The shape of the functions ρ, ω and ϕ determiningα is derived in our approach by exploiting the resultsof a simulation study. We decided to rely on simulation,rather than using measurements from real systems, sinceour model is aimed at capturing the effects associated withdata contention on the abort probability, while it is nottargeted at capturing the effects of thread-contention onhardware resources. Consequently, the instantiation of thefunctions appearing within the model has been based on an“ideal hardware” simulation model showing no contentioneffects. Anyway, when exploiting our data contention modelfor concurrency regulation in a real system, a hardwarescalability model (e.g. a queuing network-based model) canbe used to estimate variations of the processing time, dueto contention effects on shared hardware resources, as afunction of the number of the concurrent threads. In thefinal part of this paper, we provide some results that havebeen achieved by exactly using our data contention modeland a hardware scalability model in a joint fashion.

The simulation framework we have exploited in thisstudy is the same used in [4] for validating an analyticalperformance model for STM. It relies on the discrete-eventparadigm, and the implemented model simulates a closedsystem with k concurrent threads, each one alternating theexecution of transactions and ntc blocks. The simulatedconcurrency control algorithm is the default algorithm of

TinySTM (encounter time locking for write operations andtimestamp-based read validation). A transaction starts witha begin operation, then it interleaves the execution ofread/write operations (accessing a set of shared data ob-jects) and local computation phases, and, finally, executes acommit operation. The durations of ntc blocks, transactionaloperations and local computation phases are exponentiallydistributed.

In the simulation runs we performed to derive and validatethe expression of α, we varied rss and wss between 0 and200, rwa and wwa between 25 · 10−6 and 0.01, tt between10 and 150 µsec, and tntc between 0 and 15·104 µsec. Theseintervals include values that are typical for the execution ofSTM benchmarks such as [2], hence being representative ofworkload features that can be expected in real executioncontexts. Further, we varied k between 2 and 64 in thesimulations. Due to space constraints, we omit to explicitlyshow all the achieved simulation results. However, the shownresults are a significative, although concise, representation ofthe whole set of achieved results.

The construction of the analytical expressions for ρ, ω andϕ has been based on an incremental approach. Particulary,we first derive the expression of ρ analyzing simulationresults while varying workload configuration parametersaffecting it, i.e. rss, wss, rwa, wwa, and keeping fixed otherparameters. After, we calculate the values of ρ from the onesachieved for pa via simulation, which is done by using theinverse function ρ = f−1(pa), once set ω = 1 and ϕ = 1.After having identified a parametric fitting function for ρ, wederive the expression of ω via the analysis of the simulationresults achieved while also varying k. Hence, we calculateω = f−1(pa), where we use for ρ the previously identifiedexpression, and where we set ϕ = 1. Therefore, we select aparametric fitting function for ω. Finally, we use the sameapproach to derive the expression of ϕ, which is done byexploiting the simulation results achieved while varying allthe workload profile parameters and the level of concurrencyk, thus calculating ϕ = f−1(pa), where we use for ρ and ωthe previously chosen expressions.

To derive the expression of ρ, we initially analyzed viasimulation the relation between the values of pa and thevalues of the parameters wss and wwa. In Figure 1 weprovide some results showing the values of ρ as calculatedthrough the f−1(pa) inverse function (like depicted above)by relying on simulation data as the input. The data refer tovariations of wwa and to 3 different values of wss, while allthe other parameters have been kept fixed. We note that ρappears to have a logarithmic shape. Additionally, in orderto chose a parametric function fitting the calculated values ofρ, we need to consider that if wwa = 0 then pa = 0. In fact,no data contention ever arises in case of no write operationswithin the transactional profile (which implies ρ = 0). Thus,we approximated the dependency of ρ on wwa through the

following parametric logarithmic function

c · ln(a · wwa + 1) (4)

where a and c are the fitting parameters. The presence ofthe +1 term in expression (4) is due to the above-mentionedconstraint according to which wwa = 0 implies ρ = 0.

After, we also considered the effects of the parameter wsson ρ. To this aim, in Figure 2 we report the values of ρ,derived from the simulation results, while varying wss andfor 3 different values of wwa. We remark the presence ofa flex point. Therefore, in this case, we approximated thedependency of ρ on wss by using the function

e · (ln(b · wss + 1))d (5)

where b, d and e are fitting parameters, d being the onecapturing the flex. Assuming that the effects on the transac-tion abort probability are multiplicative with respect to wwa

and wss (which is aligned to what literature models statein term of the proportionality of the abort probability wrtthe multiplication of the conflict probability and the numberof operations, see, e.g., [22]), we achieved the followingparametric expression of ρ (vs wwa and wss), where d hasbeen used as the exponent also for expression (4) in orderto capture the effects of shifts of the flex point caused byvariations of wwa (as shown by the plots in Figure 2 relyingon simulation)

[c · (ln(b · wss + 1)) · ln(a · wwa + 1)]d (6)

where we collapsed the original parameters c and e withinone single parameters c. We validated the accuracy of theexpression (6) via comparison with values achieved througha set of simulations, where we used different workloadprofiles. The parameters appearing in expression (6) havebeen calculated through regression analysis. Specifically, foreach test, we based the regression analysis on 40 randomlyselected workload profiles achieved while varying wwa andwss. Then, we measured the average error between thetransaction abort probability evaluated via simulation andthe one predicted using for ρ the function in expression (6)for a set of 80 randomly selected workload profiles. As anexample, in Figure 6, we depict results for the case withk = 8. Along the x-axis, workload profiles are identify byinteger numbers and are ordered by the values of wss andwwa. The measured average error in all the tests was 5.3%.

Successively, we considered the effects on the transactionabort probability caused by read operations. Thus, we ana-lyzed the relation between pa and the parameters rss, rwa

and wss. The parameter wss is included since contentionon transactional read operations is affected by the amountof write operations by concurrent transactions. In Figure 4we report simulation results showing the values of ρ whilevarying rss and for 3 different values of rwa. In Figure5, we report values of ρ achieved while varying rwa andfor 3 different values of rss. We note that the shape of the

curves are similar to the above cases, where we analyzedthe relation between pa and the parameters wwa and wss.Thus, using a similar approach, and considering that pa isalso proportional to wss, we approximate the dependencyof ρ on rwa, wss and wwa using the following function

[e · (ln(f · rwa + 1)) · ln(g · rss + 1) · wss]z (7)

where e, f , g and z are the fitting parameters. The finalexpression for ρ is then derived summing expressions (6) and(7). The intuitive motivation is that adding read operationswithin a transaction, the likelihood of abort due to conflictson original write operations does not change. However, theadded operations lead to an increase of the overall abortprobability, which we capture summing the two expressions.Also in this case, we validated the final expression for ρvia comparison with the values achieved through a set ofsimulations, where we varied the workload profile. Similarlyto what done before, the regression analysis has been basedon 40 workload profiles, while the comparison has beenbased on 80 workload profiles, all selected by randomlyvarying wwa, wss, rwa, rss. The results for k = 8 arereported in Figure 6. Along the x-axis, workload profiles areordered by values of rss, rwa, wss and wwa. The averageerror we measured in all the tests was 2.7%.

Successively, in order to build the expression for ω, weconsidered the effects of the number of concurrent threads,namely the parameter k, on the abort probability. On thebasis of simulation results, some of which are reported inFigure 7, we decided also in this case to use a parametriclogarithmic function as the approximation curve of ω vs k.Clearly, the constraint needs to be accounted for that if k = 1then ω = 0 (since the absence of concurrency cannot giverise to transaction aborts). Thus, we approximate ω as

h · (ln(l · (k − 1) + 1), (8)

where h and l are the fitting parameters. Again, we validatedthe out-coming function for pa, depending on ω (and hencedepending on modeled effects of the variation of k), usingthe same amount of workload profiles as in the previousstudies, still selected by randomly varying wwa, wss, rwa,rss and k. Some results are depicted in Figure 8 forvariations of k between 1 and 64. The average error wemeasured in all the tests was 2.1%.

Finally, we built the expression of ϕ, which depends ontt and tntc. To this aim, we note that if tt = 0 (whichrepresent the unrealistic case where transactions are executedinstantaneously) then ϕ must be equal to 0 (given that thelikelihood of concurrent transactions is zero). Additionally,we note that tt can be seen as the duration of a vulnerabilitywindow during which the transaction is subject to be aborted.For longer fractions of time during which transactions arevulnerable, higher probability of actual transaction aborts

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 0.002 0.004 0.006 0.008 0.01

ρ

wwa

ρ vs write-write affinity

wss=10wss=30

wss=100

Figure 1: Variation of ρ with respect tothe write-write affinity

0

1

2

3

4

5

6

7

0 50 100 150 200

ρ

wss

ρ vs write-set size

wwa=0.001wwa=0.005wwa=0.01

Figure 2: Variation of ρ with respect tothe write-set size

0

0.2

0.4

0.6

0.8

1

1.2

10 20 30 40 50 60 70 80

abor

t Pro

babi

lity

workload profile id

Transaction abort probability vs. workload profiles

Achieved by simulationPredicted by the model

Figure 3: Simulated vs predicted abortprobability while varying wwa and wss

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 2e-05 4e-05 6e-05 8e-05 0.0001

ρ

rwa

ρ vs read-write affinity

rss=10rss=30

rss=100

Figure 4: Variation of ρ with respect tothe read-write affinity

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 50 100 150 200

ρ

rss

ρ vs read-set size

rwa=0.000025rwa=0.00003rwa=0.0001

Figure 5: Variation of ρ with respect tothe read-set size

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40 50 60 70 80

abor

t pro

babi

lity

workload profile id


Achieved by simulationsPredicted by the model

Figure 6: Simulated vs predicted abortprobability vs rwa, rss, wwa, wss

0

1

2

3

4

5

6

7

0 10 20 30 40 50 60 70

ω

number of concurrent threads

ω vs number of concurrent threads

rss=24 wss=6, wwa=0.001 rwa=0.001rss=18, wss=12, wwa=0.001, rwa=0.001rss=36, wss=24, wwa=0.001, rwa=0.001

rss=36, wss=24,wwa=0.0001, rwa=0.0001

Figure 7: Variation of ω vs the numberof concurrent threads

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50 60 70 80

abor

t pro

babi

lity

workload profile id



Figure 8: Simulated vs predicted abortprobability vs k, rwa, rss, wwa, wss

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

φ

θ

φ vs θ

rss=24 wss=6, wwa=0.00005 rwa=0.00005 k=6rss=72, wss=18, wwa=0.0001, rwa=0.0001 k=6rss=24, wss=6, wwa=0.0001, rwa=0.0001 k=12rss=24, wss=6, wwa=0.0001, rwa=0.0001 k=6

Figure 9: Variation of ϕ with respect toθ

can be expected. Thus we assume ϕ to be proportional to

θ =tt

tt + tntc(9)

We analyzed through simulation the relation between ϕ andθ. Some results are shown in Figure 9, on the basis of whichwe decided to approximate ϕ using the function:

m · ln(n · θ + 1) (10)

where m and n are the fitting parameters.The expression of pa in equation (3) is now fully defined.

To validate it, we used the same approach that has beenadopted for the validation of each of the aforementionedincremental steps. Some results, where we randomly selected

workload profiles, are shown in Figure 10. In all our tests,we measured an average relative error of 4.8%.

D. Model Validation with Respect to a Real System

As a further validation step we compared the outputby the proposed model with real measurements taken byrunning applications belonging to the STAMP benchmarksuite [2] on top of the open source TinySTM framework[11]. Additionally, we evaluated the model ability to provideaccurate predictions while varying the amount of samplesused to perform the regression analysis, gathered throughobservations of the behavior of the real system. Particularly,we evaluated the extrapolation capability of the model,namely its ability to forecast the transaction abort probability

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50 60 70 80

abor

t pro

babi

lity

workload profile id



Figure 10: Simulated vs predicted abort probability whilevarying all the workload profile parameters

that would be achieved when running the STM applicationwith concurrency levels (number of threads) not included inthe observed domain where regression samples were taken.

The presented results refer to three different benchmarkapplications of the STAMP suite, namely Kmeans, Yada andVacation. As shown in [2], these applications are character-ized by quite different workload profiles. This allowed usto evaluate the model accuracy with respect to a relativelywide workload configuration domain. All the tests havebeen performed on top of an HP ProLiant server equippedwith two AMD OpteronTM6128 Series Processor, each onehaving eight CPU-cores (for a total of 16 cores), and 32 GBRAM, running Linux (kernel version 2.7.32-5-amd64).

For each application, we performed regression analysisto calculate three different sets of values for the modelparameters, hence instantiating three models relying onthe proposed parametric analysis. Any regression has beenperformed using one of three different sets of measure-ments, each set including 80 samples. The first set includedsamples gathered observing the application running with 2and 4 concurrent threads. The second one included samplesgathered observing the application running with 2, 4 and 8concurrent threads. Finally, the third one included samplesgathered observing the application running with 2, 4, 8and 16 concurrent threads. This allowed us to evaluate theextrapolation ability of the model, with respect to the numberof concurrent threads, while observing the application forlimited amounts of concurrency levels (say for 2, 3 or 4different levels of concurrency). We performed, for eachapplication, the following tests. After setting up the modelinstances, we executed a set of runs of the applicationusing different values for the application input parameters(leading the same application to run with somehow differentworkload profiles) and with a number of concurrent threadsspanning from 2 to 16. During each run, we measured theaverage values of the workload profile features included inthe set P along different observation intervals having a pre-established length, and we used them as the input to thethree instantiated models in order to compute the expectedabort probability for each observation interval. After, for

Observed concurrency levels for the regression analysisapplication 2/4 threads 2/4/8 threads 2/4/8/16 threads

Vacation 2.166% (0,00089) 1.323% (0,00028) 1.505% (0,00032)Kmeans 18.938% (0,09961) 2.086% (0,00100) 2.591% (0,00109)

Yada 2.385% (0,00029) 2.086% (0,00016) 2.083% (0,00022)

Table I: Abort probability prediction error (and its variance)

each instantiated model, we compared the predicted valuewith the real one observed during the runs.

In Table I, we report the average value of the predictionerror (and its variance) for all the target benchmark applica-tions, and for the three model instances, while consideringvariations of the actual level of concurrency between 2 and16. By the results, we note that, for the cases of Yadaand Vacation, it has been sufficient to execute regressionanalysis with samples gathered observing the applicationrunning with only 2 and 4 threads in order to achieve anaverage prediction error bounded by 2.4% for any levelof concurrency between 2 and 16. When enlarging theobservation domain for the gathering of samples to be usedby regression, i.e. when observing the application runningalso with 8 concurrent threads, we achieved for Yada aslight error reduction. With Vacation, the reduction is moreaccentuated. On the other hand, the prediction error achievedfor Kmeans with observations of the application runningwith 2 and 4 concurrent threads was greater. However, suchan error drastically drops down when including samplesgathered with 8 concurrent threads in the data set forregression. As for regression based on samples gathered with2, 4, 8 and 16 threads, we note that the error marginallyincreases in all the cases. We believe that this is due tothe high variance of the values of the transaction abortprobability we measured for executions with 16 concurrentthreads, which gives rise to variability of the results of theregression analysis depending on the set of used observa-tions. Overall, by the results, we achieved good accuracyand effectiveness by the model since it can provide lowprediction error, for a relatively wide range of hypothesizedthread concurrency levels (namely between 2 and 16) by justrelying on observing the application running with 2, 4 and(at worst also) 8 concurrent threads.

We conclude this section comparing the extrapolationability of our model with respect the neural network-basedmodel proposed in [14], which, similarly to ours, has beentargeted at the estimation of the STM performance (vs thelevel of concurrency). To perform fair comparison, a sameset of observations has been provided in input to both themodels. Particularly, the reported results refer to the Yadabenchmark application, for which we provided a set of 80observations (the same used for validating the model, asshown above), related to executions with 2 and 4 concurrentthreads, in input to both our parametric model and theneural network based model in [14]. As for the neuralnetwork approach, we used a back-propagation algorithm

[15], and we selected the best trained network, in termsof prediction accuracy, among a set of networks having anumber of hidden nodes spanning from 2 to 16, using anumber of algorithm iterations spanning from 50 to 1600.In Figure 11, we show two dispersion charts, each onerepresenting the correlation between the measured valuesof the transaction abort probability and the ones predictedusing the model (left chart) and the neural network (rightchart). These refer to concurrency levels spanning in thewhole interval 2-16. We remark that a lower predictionerror corresponds to a higher concentration of points alongthe diagonal straight line evidenced in the graphs. We cansee that, in the case of the neural network, there is asignificantly wider dispersion of points compared to themodel we are proposing. In fact, the average predictionerror for the neural network is equal to 17.3%, while forthe model it is equal to 2.385%. This is a clear indicationof higher ability to extrapolate the abort probability by themodel when targeting concurrency levels for which no realexecution sample is available (and/or that are far from theconcurrency levels for which sampling has been actuated).As a reflection, the parametric model we present provideshighly reliable estimations, even with a few profiling dataavailable for the instantiation of its parameters. Hence it issuited for the construction of concurrency regulation systemsinducing low overhead and providing timely selection ofthe best suited parallelism configuration (just because themodel needs a few samples related to a limited set ofconfigurations in order to deliver its reliable prediction onthe optimal concurrency level to be adopted). A concurrencyself-regulation architecture exploiting the parametric modelis presented and experimentally assessed in the next section.

IV. CONCURRENCY SELF-REGULATING STM

A. The Architecture

The architecture of the Concurrency Self-Regulating STM(CSR-STM) is depicted in Figure 12. A Statistic Collector(SC) provides a Control Algorithm (CA) with the averagevalues of workload profile parameters, i.e. rss, wss, rwa,wwa, tt and tntc, measured by observing the applicationon a periodic basis. Then, the CA exploits these valuesto calculate, through the parametric model, the transactionabort probability pa,k as predicted when using k concurrentthreads, for each k such that 1 ≤ k ≤ maxthread. The valuemaxthread represents the maximum amount of concurrentthreads admitted for executing the application. We remarkthat a number of concurrent threads larger then the numberof available CPU-cores typically penalizes STM perfor-mance (e.g. due to costs related to context-switches amongthe threads). Hence, it is generally convenient to boundmaxthread to the maximum number of available CPU-cores.The set {(pa,k), 1 ≤ k ≤ maxthread} of predictions is usedby CA to estimate the number m of concurrent threads

which is expected to maximize the application throughput.Particularly, m is identified as the value of k for which

k

wt,k + tt,k + tntc,k(11)

is maximized. In the above expression: wt,k is the averagetransaction wasted time (i.e. the average execution time spentfor all the aborted runs of a transaction); tt,k is the averageexecution time of committed transaction runs; tntc,k is theaverage execution time of ntc blocks. All these parametersrefer to the scenario where the application is supposed torun with k concurrent threads.

We note that wt,k + tt,k + tntc,k is the average execu-tion time between commit operations of two consecutivetransactions executed by the same thread when there are kactive threads. Hence, expression (11) represents the systemthroughput. Now we discuss how wt,k, tt,k and tntc,k areestimated. We note that wt,k can be evaluated by multiplyingthe average number of aborted runs of a transaction andthe average duration of an aborted transaction run whenthe application is executed with k concurrent threads. Thus,the average number of aborted transaction runs with kconcurrent threads can be estimated as pa,k/(1−pa,k), wherepa,k is calculated through the presented model.

To calculate the average duration of an aborted transactionrun, and to estimate tt,k and tntc,k, while varying k, anhardware scalability model has to be used. In the presentedversion of CSR-STM, we exploited the model proposed in[12], where the function modeling hardware scalability is

C(k) = 1 + p · (k − 1) + q · k · (k − 1) (12)

where p and q are fitting parameters, and C(k) is thescaling factor when the application runs with k concurrentthreads. The values of p and q are again calculated throughregression analysis. Thus, assuming that, e.g., during the lastobservation interval there were x concurrent threads and themeasured average transaction execution time was tt,x, CAcan calculate tt,k for each value of k through the formulatt,k = C(k)/C(x) · tt,x.

Once estimated the number m of concurrent threads whichis expected to maximize the application throughput, exactlym threads are kept active by CA during the subsequentworkload sampling interval.

B. Evaluation Study

In this section we present an experimental assessmentof CSR-STM, where we used Vacation, Kmeans and Yada,which have been run on top of the same 16-core HP ProLiantserver exploited for previous experiments. All the tests wepresent focus on the comparison of the execution timeachieved by running the applications on top of CSR-STMand on top of the original version of TinySTM. Specifically,in each test, we measured, for both CSR-STM and TinySTM,the delivered application execution times while varying

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

pred

icte

d

measured

Model prediction accuracy

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

pred

icte

d

measured

Neural network prediction accuracy

Figure 11: Model and neural-network prediction accuracy Figure 12: CSR-STM architecture

maxthread between 2 and 16. For TinySTM, maxthread

corresponds to the (fixed) number of concurrent threadsexploited by the application. While, in the case of CSR-STM, the application starts its execution with a number ofconcurrent threads equal to maxthread. However, CSR-STMmay lead to changes of the number of concurrent threadssetting it to any value between 1 and maxthread.

For each application, we calculated the values of themodel parameters through regression analysis, using samplesgathered observing the application running with 2 and 4concurrent threads for the cases of Vacation and Yada, andincluding also observations with 8 concurrent threads forthe case of Intruder. As for the parameters appearing inthe hardware scalability model expressed in (12), regressionanalysis has been performed by using, for each application,the measured average values of the committed runs oftransactions, observed with 2, 4 and 8 concurrent threads.

We performed a number of runs using, for each appli-cation, different values for the input parameters. Due tospace constraints, we only report results achieved with twodifferent workload profiles for each application, which areshown in Figures 13, 14 and 15 for Vacation, Kmeans andYada, respectively. We explicitly report, according to theinput-string syntax established by STAMP, the values of theinput parameters used to run the applications.

Observing the results, the advantages of CSR-STM withrespect to TinySTM can be easily appreciated. For sys-tem configurations where CSR-STM is allowed to usea maximum number of threads (maxthread) greater thenthe optimal concurrency level (as identified by the peakperformance delivered by TinySTM), it always tunes theconcurrency level to suited values. Thus it avoids the perfor-mance loss experienced by TinySTM when making availablea number of CPU-cores exceeding the optimal parallelismlevel. Particularly, the performance by TinySTM tends toconstantly degrade while incrementing the parallelism level.Conversely, CSR-STM prevents this performance loss, pro-viding a performance level which is, for the majority of thecases, near to the best value, independently of the actualnumber of available CPU-cores for running the application.

Obviously, when maxthread is lower then the optimumconcurrency level, CSR-STM can not activate the well suitednumber of concurrent threads, which equals the optimal levelof parallelism. Thus, for these configurations, the perfor-mance of CSR-STM is, in some cases, slightly reduced withrespect to TinySTM due to the overhead associated with thecomponents/tasks proper of the concurrency self-regulationmechanism. As for the latter aspect, all the componentsexcept SC (for which we measured a negligible overhead),require a single (non-CPU-bound) thread. Thus, resourcedemand is reduced, wrt the total application demand, of afactor bounded by 1/k (when k CPU-cores are available).Accordingly, the cases where CSR-STM provides lower per-formance than TinySTM (e.g. when maxthread is less than4 for Vacation and Kmeans), the advantage by TinySTMprogressively decreases vs maxthread.

V. SUMMARY

In this article we have presented a parametric analyticalmodel for determining the optimal level of concurrencyin STM applications. Instantiation of the parameters canbe actuated via a light regression process based on a fewsamples related to the run-time behavior of the application.Also, the model does not rely on any strong assumptionin relation to the application profile, hence being usablein generic application contexts. It has been validated viacomparison with real data traced by running applicationsfrom the STAMP benchmark suite on top of a 16-core HPProLiant machine. We also presented a concurrency self-regulation architecture based on the model, which has beenintegrated in the TinySTM open source framework, andreported experimental data showing its ability to regulatethe concurrency level to well suited values.

REFERENCES

[1] M. Ansari, C. Kotselidis, K. Jarvis, M. Lujan, C. Kirkham,and I. Watson. Advanced concurrency control for transac-tional memory using transaction commit rate. In Proc. 14thInt. Euro-Par Conference, pages 719–728. 2008.

[2] C. Cao Minh, J. Chung, C. Kozyrakis, and K. Oluko-tun. STAMP: Stanford transactional applications for multi-processing. In Proc. 4th IEEE Int. Symposium on WorkloadCharacterization, pages 35-46. 2008.

0

20

40

60

80

100

2 4 6 8 10 12 14 16

appl

icat

ion

exec

utio

n tim

e (s

ec.)

(max) concurrent threads

vacation input: -n4 -q60 -u90 -r32768 -t4194304 -p800000

TinySTMCSR-STM

0

20

40

60

80

100

2 4 6 8 10 12 14 16

appl

icat

ion

exec

utio

n tim

e (s

ec.)


vacation input: -n4 -q90 -u98 -r32768 -t4194304 -p400000

TinySTMCSR-STM

Figure 13: Execution time for Vacationwith CSR-STM and TinySTM

0

1

2

3

4

5

6

7

2 4 6 8 10 12 14 16

appl

icat

ion

exec

utio

n tim

e (s

ec.)


kmeans input: -m10 -n10 -t0,0005 -i random-n65536-d32-c16.txt

TinySTMCSR-STM

0

2

4

6

8

10

12

14

2 4 6 8 10 12 14 16

appl

icat

ion

exec

utio

n tim

e (s

ec.)


kmeans input: -m10 -n10 -t0,00005 -i random-n65536-d32-c16.txt

TinySTMCSR-STM

Figure 14: Execution time for Kmeanswith CSR-STM and TinySTM

2

4

6

8

10

12

14

2 4 6 8 10 12 14 16

appl

icat

ion

exec

utio

n tim

e (s

ec.)


yada input: -a20 -i ttimeu100000.2

TinySTMCSR-STM

1

2

3

4

5

6

7

2 4 6 8 10 12 14 16

appl

icat

ion

exec

utio

n tim

e (s

ec.)


yada input: -a30 -i ttimeu10000.2

TinySTMCSR-STM

Figure 15: Execution time for Yada withCSR-STM and TinySTM

[3] M. Castro, L. F. W. Goes, C. P. Ribeiro, M. Cole, M. Cintra,and J.-F. Mehaut. A machine learning-based approach forthread mapping on transactional memory applications. InProc. 18th Int. Conf. on High Perf. Comp., pages 1-10. 2011.

[4] P. di Sanzo, B. Ciciani, R. Palmieri, F. Quaglia, and P. Ro-mano. On the analytical modeling of concurrency controlalgorithms for software transactional memories: The case ofcommit-time-locking. Perf. Eval., 69(5):187–205, 2012.

[5] P. di Sanzo, B. Ciciani, F. Quaglia, and P. Romano. Aperformance model of multi-version concurrency control. InProc. 16th Int. Symp. on Modeling, Analysis and Simulationof Computer and Telecomm. Systems, pages 41–50. 2008.

[6] P. di Sanzo, R. Palmieri, B. Ciciani, F. Quaglia, and P. Ro-mano. Analytical modeling of lock-based concurrency controlwith arbitrary transaction data access patterns. In Proc. 26thInt. Conf. on Performance Eng., pages 69–78. 2010.

[7] D. Didona, P. Felber, D. Harmanci, P. Romano, andJ. Schenker. Identifying the optimal level of parallelismin transactional memory systems. In Proc. Int. Conf. onNetworked Systems. 2013.

[8] A. Dragojevic and R. Guerraoui. Predicting the scalability ofan stm a pragmatic approach. In Proc. 5th ACM Workshopon Transactional Computing. 2010.

[9] A. Dragojevic, R. Guerraoui, A. V. Singh, and V. Singh.Preventing versus curing: avoiding conflicts in transactionalmemories. In Proc. 28th ACM Symposium on Principles ofDistributed Computing, pages 7–16. 2009.

[10] S. Elnikety, S. Dropsho, E. Cecchet, and W. Zwaenepoel.Predicting replicated database scalability from standalonedatabase profiling. In Proc. 4th ACM European Conferenceon Computer Systems, pages 303–316. 2009.

[11] P. Felber, C. Fetzer, and T. Riegel. Dynamic performancetuning of word-based software transactional memory. In Proc.13th ACM Symposium on Principles and Practice of ParallelProgramming, pages 237–246. 2008.

[12] N. J. Gunther. Guerrilla capacity planning - a tacticalapproach to planning for highly scalable applications andservices. Springer, 2007.

[13] Z. He and B. Hong. Modeling the run-time behavior oftransactional memory. In Proc. 18th Int. Symp. on Modeling,Analysis and Simulation of Computer and Telecomm. Systems,pages 307–315. 2010.

[14] D. Rughetti, P. di Sanzo, B. Ciciani, and F. Quaglia. Machinelearning-based self-adjusting concurrency in software transac-tional memory systems. In Proc. 20th Int. Symp. on Modeling,Analysis and Simulation of Computer and Telecomm. Systems,pages 278–285. 2012.

[15] S. J. Russell, P. Norvig, J. F. Candy, J. M. Malik, andD. D. Edwards. Artificial intelligence: a modern approach.Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1996.

[16] I. K. Ryu and A. Thomasian. Performance analysis ofcentralized databases with optimistic concurrency control.Performance Evaluation, 7(3):195–211, 1987.

[17] N. Shavit and D. Touitou. Software transactional memory.In Proc. 14th ACM Symposium on Principles of DistributedComputing, pages 204–213. 1995.

[18] Y. C. Tay, N. Goodman, and R. Suri. Locking performancein centralized databases. ACM Transaction on DatabaseSystems, pages 415–462, 1985.

[19] A. Thomasian. Concurrency control: methods, performance,and analysis. ACM Comp. Surveys, pages 70–119, 1998.

[20] Q. Wang, S. Kulkarni, J. V. Cavazos, and M. Spear. Towardsapplying machine learning to adaptive transactional memory.In Proc. 6th Workshop on Transactional Computing. 2011.

[21] R. M. Yoo and H.-H. S. Lee. Adaptive transaction schedulingfor transactional memory systems. In Proc. 20th Symp. onParallelism in Algorithms and Archit., pages 169-178. 2008.

[22] P. S. Yu, D. M. Dias, and S. S. Lavenberg. On the analyticalmodeling of database concurrency control. Journal of theACM, pages 831–872, 1993.

Regulating Concurrency in Software Transactional Memory

Documents