LogCA: A High-Level Performance Model for Hardware ...research.cs.wisc.edu/multifacet/papers/isca17_logca.pdf · LogCA: A High-Level Performance Model for Hardware Accelerators Muhammad

Post on 07-Apr-2018

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

LogCA A High-Level Performance Model for HardwareAccelerators

Muhammad Shoaib Bin Altaf lowast

AMD ResearchAdvanced Micro Devices Inc

shoaibaltafamdcom

David A WoodComputer Sciences Department

University of Wisconsin-Madisondavidcswiscedu

ABSTRACTWith the end of Dennard scaling architects have increasingly turnedto special-purpose hardware accelerators to improve the performanceand energy efficiency for some applications Unfortunately accel-erators donrsquot always live up to their expectations and may under-perform in some situations Understanding the factors which effectthe performance of an accelerator is crucial for both architects andprogrammers early in the design stage Detailed models can behighly accurate but often require low-level details which are notavailable until late in the design cycle In contrast simple analyticalmodels can provide useful insights by abstracting away low-levelsystem details

In this paper we propose LogCAmdasha high-level performancemodel for hardware accelerators LogCA helps both programmersand architects identify performance bounds and design bottlenecksearly in the design cycle and provide insight into which optimiza-tions may alleviate these bottlenecks We validate our model acrossa variety of kernels ranging from sub-linear to super-linear com-plexities on both on-chip and off-chip accelerators We also describethe utility of LogCA using two retrospective case studies First wediscuss the evolution of interface design in SUNOraclersquos encryptionaccelerators Second we discuss the evolution of memory interfacedesign in three different GPU architectures In both cases we showthat the adopted design optimizations for these machines are similarto LogCArsquos suggested optimizations We argue that architects andprogrammers can use insights from these retrospective studies forimproving future designs

CCS CONCEPTSbull Computing methodologies rarr Modeling methodologies bull Com-puter systems organization rarr Heterogeneous (hybrid) systemsbull Hardware rarr Hardware accelerators

KEYWORDSAnalytical modeling Performance Accelerators Heterogenous ar-chitectures

lowastThis work was done while a PhD student at Wisconsin

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page Copyrights for components of this work owned by others than ACMmust be honored Abstracting with credit is permitted To copy otherwise or republishto post on servers or to redistribute to lists requires prior specific permission andor afee Request permissions from permissionsacmorgISCA rsquo17 June 24-28 2017 Toronto ON Canadacopy 2017 Association for Computing MachineryACM ISBN 978-1-4503-4892-81706 $1500httpsdoiorg10114530798563080216

16 64 256 1K 4K 16

K64

K

0001

001

01

1

10

Break-even point

Offloaded Data (Bytes)

Tim

e(m

s)

Unaccelerated Accelerated

(a) Execution time on UltraSPARC T2

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

Break-even point

Offloaded Data (Bytes)

Sp

eed

up

SPARC T4 UltraSPARC T2 GPU

(b) Variation in speedup for different crypto accelerators

Figure 1 Executing Advanced Encryption Standard (AES)[30]

ACM Reference formatMuhammad Shoaib Bin Altaf and David A Wood 2017 LogCA A High-Level Performance Model for Hardware Accelerators In Proceedings ofISCA rsquo17 Toronto ON Canada June 24-28 2017 14 pageshttpsdoiorg10114530798563080216

1 INTRODUCTIONThe failure of Dennard scaling [12 49] over the last decade hasinspired architects to introduce specialized functional units such asaccelerators [6 36] These accelerators have shown considerableperformance and energy improvement over general-purpose coresfor some applications [14 16 23 25 26 50 51 55] Commercialprocessors already incorporate a variety of accelerators rangingfrom encryption to compression from video streaming to patternmatching and from database query engines to graphics processing[13 37 45]

Unfortunately accelerators do not always live up to their name orpotential Offloading a kernel to an accelerator incurs latency andoverhead that depends on the amount of offloaded data location of

ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

accelerator and its interface with the system In some cases thesefactors may outweigh the potential benefits resulting in lower thanexpected ormdashin the worst casemdashno performance gains Figure 1illustrates such an outcome for the crypto accelerator in UltraSPARCT2 running the Advanced Encryption Standard (AES) kernel [30]

Figure 1 provides two key observations First accelerators canunder-perform as compared to general-purpose core eg the ac-celerated version in UltraSPARC T2 outperforms the unacceleratedone only after crossing a threshold block size ie the break-evenpoint (Figure 1-a) Second different acceleratorsmdashwhile executingthe same kernelmdashhave different break-even points eg SPARC T4breaks even for smaller offloaded data while UltraSPARC T2 andGPU break even for large offloaded data (Figure 1-b)

Understanding the factors which dictate the performance of anaccelerator are crucial for both architects and programmers Pro-grammers need to be able to predict when offloading a kernel will beperformance efficient Similarly architects need to understand howthe acceleratorrsquos interfacemdashand the resulting latency and overheadsto offload a kernelmdashwill affect the achievable accelerator perfor-mance Considering the AES encryption example programmers andarchitects would greatly benefit from understanding What bottle-necks cause UltraSPARC T2 and GPU to under-perform for smalldata sizes Which optimizations on UltraSPARC T2 and GPU resultin similar performance to SPARC T4 Which optimizations are pro-grammer dependent and which are architect dependent What arethe trade-offs in selecting one optimization over the other

To answer these questions programmer and architects can employeither complex or simple modeling techniques Complex modelingtechniques and full-system simulation [8 42] can provide highlyaccurate performance estimates Unfortunately they often requirelow-level system details which are not available till late in the designcycle In contrast analytical modelsmdashsimpler ones in particularmdashabstract away these low-level system details and provide key insightsearly in the design cycle that are useful for experts and non-expertsalike [2 5 19 27 48 54]

For an insightful model for hardware accelerators this paperpresents LogCA LogCA derives its name from five key parameters(Table 1) These parameters characterize the communication latency(L) and overheads (o) of the accelerator interface the granularitysize(g) of the offloaded data the complexity (C) of the computation andthe acceleratorrsquos performance improvement (A) as compared to ageneral-purpose core

LogCA is inspired by LogP [9] the well-known parallel compu-tation model LogP sought to find the right balance between overlysimple models (eg PRAM) and the detailed reality of modern par-allel systems LogCA seeks to strike the same balance for hardwareaccelerators providing sufficient simplicity such that programmersand architects can easily reason with it Just as LogP was not thefirst model of parallel computation LogCA is not the first model forhardware accelerators [28] With LogCA our goal is to develop asimple model that supports the important implications (sect2) of ouranalysis and use as few parameters as possible while providing suf-ficient accuracy In Einsteinrsquos words we want our model to be assimple as possible and no simpler

LogCA helps programmers and architects reason about an accel-erator by abstracting the underlying architecture It provides insights

about the acceleratorrsquos interface by exposing the design bounds andbottlenecks and suggests optimizations to alleviate these bottlenecksThe visually identifiable optimization regions help both experts andnon-experts to quantify the trade-offs in favoring one optimizationover the other While the general trend may not be surprising weargue that LogCA is accurate enough to answer important what-ifquestions very early in the design cycle

We validate our model across on-chip and off-chip acceleratorsfor a diverse set of kernels ranging from sub-linear to super-linearcomplexities We also demonstrate the utility of our model usingtwo retrospective case studies (sect5) In the first case study we con-sider the evolution of interface in the cryptographic accelerator onSunOraclersquos SPARC T-series processors For the second case weconsider the memory interface design in three different GPU ar-chitectures a discrete an integrated and a heterogeneous systemarchitecture (HSA) [38] supported GPU In both case studies weshow that the adopted design optimizations for these machines aresimilar to LogCArsquos suggested optimizations We argue that architectsand programmers can use insights from these retrospective studiesfor improving future designs

This paper makes the following contributions

bull We propose a high-level visual performance model provid-ing insights about the interface of hardware accelerators(sect2)

bull We formalize performance metrics for predicting the ldquorightrdquoamount of offloaded data (sect22)

bull Our model identifies the performance bounds and bottle-necks associated with an accelerator design (sect3)

bull We provide an answer to what-if questions for both pro-grammers and architects at an early design stage (sect3)

bull We define various optimization regions and the potentialgains associated with these regions (sect32)

bull We demonstrate the utility of our model on five differentcryptographic accelerators and three different GPU archi-tectures (sect5)

2 THE LogCA MODELLogCA assumes an abstract system with three components (Figure 2(a)) Host is a general-purpose processor Accelerator is a hardwaredevice designed for the efficient implementation of an algorithmand Interface connects the host and accelerator abstracting awaysystem details including the memory hierarchy

Our model uses the interface abstraction to provide intuition forthe overhead and latency of dispatching work to an accelerator Thisabstraction enables modeling of different paradigms for attachingacceleratorsmdashdirectly connected system bus or PCIe This alsogives the flexibility to use our model for both on-chip and off-chipaccelerators This abstraction can also be trivially mapped to sharedmemory systems or other memory hierarchies in heterogeneousarchitectures The model further abstracts the underlying architectureusing the five parameters defined in Table 1

Figure 2 (b) illustrates the overhead and latency model for anun-pipelined accelerator where computation lsquoirsquo is returned before re-questing computation lsquoi+1rsquo Figure 2 (b) also shows the breakdownof time for an algorithm on the host and accelerator We assume that

LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

Table 1 Description of the LogCA parameters

Parameter Symbol Description Units

Latency L Cycles to move data from the host to the accelerator across the interface including the cycles dataspends in the caches or memory

Cycles

Overhead o Cycles the host spends in setting up the algorithm Cycles

Granularity g Size of the offloaded data Bytes

Computational Index C Cycles the host spends per byte of data CyclesByte

Acceleration A The peak speedup of an accelerator NA

Host Accelerator

Interface

time

Co(g)

o1(g) L1(g)C1(g) =

Co(g)A

Gain

T0(g)

T1(g)

(a) (b)

Figure 2 Top level description of the LogCA model (a) Showsthe various components (b) Time-line for the computation per-formed on the host system (above) and on an accelerator (be-low)

the algorithmrsquos execution time is a function of granularity ie thesize of the offloaded data With this assumption the unacceleratedtime T0 (time with zero accelerators) to process data of granularityg will be T0 (g) =C0 (g) where C0 (g) is the computation time on thehost

When the data is offloaded to an accelerator the new executiontime T1 (time with one accelerator) is T1 (g) =O1 (g)+L1 (g)+C1 (g)where O1 (g) is the host overhead time in offloading lsquogrsquo bytes ofdata to the accelerator L1 (g) is the interface latency and C1 (g) is thecomputation time in the accelerator to process data of granularity g

To make our model more concrete we make several assumptionsWe assume that an accelerator with acceleration lsquoArsquo can decreasein the absence of overheads the algorithmrsquos computation time onthe host by a factor of lsquoArsquo ie the accelerator and host use algo-rithms with the same complexity Thus the computation time on theaccelerator will be C1 (g) =

C0 (g)A This reduction in the computation

time results in performance gains and we quantify these gains withspeedup the ratio of the un-accelerated and accelerated time

Speedup(g) =T0 (g)T1 (g)

=C0 (g)

O1 (g)+L1 (g)+C1 (g)(1)

We assume that the computation time is a function of the com-putational index lsquoCrsquo and granularity ie C0 (g) =C lowast f (g) wheref (g) signifies the complexity of the algorithm We also assume thatf (g) is power function of rsquogrsquo ie O (gβ ) This assumption resultsin a simple closed-form model and bounds the performance for amajority of the prevalent algorithms in the high-performance comput-ing community [4] ranging from sub-linear (β lt 1) to super-linear(β gt 1) complexities However this assumption may not work wellfor logarithmic complexity algorithms ie O (log(g))O (g log(g))This is because asymptotically there is no function which grows

slower than a logarithmic function Despite this limitation we ob-serve thatmdashin the granularity range of our interestmdashLogCA can alsobound the performance for logarithmic functions (sect5)

For many algorithms and accelerators the overhead is indepen-dent of the granularity ie O1 (g) = o Latency on the other handwill often be granularity dependent ie L1 (g) = Llowastg Latency maybe granularity independent if the accelerator can begin operatingwhen the first byte (or block) arrives at the accelerator ie L1 (g) = LThus LogCA can also model pipelined interfaces using granularityindependent latency assumption

We define computational intensity1 as the ratio of computationalindex to latency ie C

L and it signifies the amount of work done ona host per byte of offloaded data Similarly we define acceleratorrsquoscomputational intensity as the ratio of computational intensity toacceleration ie CA

L and it signifies the amount of work done onan accelerator per byte of offloaded data

For simplicity we begin with the assumption of granularity in-dependent latency We revisit granularity dependent latencies later(sect 23) With these assumptions

Speedup(g) =C lowast f (g)

o+L+ Clowast f (g)A

=C lowastgβ

o+L+ Clowastgβ

A

(2)

The above equation shows that the speedup is dependent on LogCAparameters and these parameters can be changed by architects andprogrammers through algorithmic and design choices An architectcan reduce the latency by integrating an accelerator more closelywith the host For example placing it on the processor die ratherthan on an IO bus An architect can also reduce the overheads bydesigning a simpler interface ie limited OS intervention and ad-dress translations lower initialization time and reduced data copyingbetween buffers (memories) etc A programmer can increase thecomputational index by increasing the amount of work per byteoffloaded to an accelerator For example kernel fusion [47 52]mdashwhere multiple computational kernels are fused into onemdashtends toincrease the computational index Finally an architect can typicallyincrease the acceleration by investing more chip resources or powerto an accelerator

21 Effect of GranularityA key aspect of LogCA is that it captures the effect of granularity onthe acceleratorrsquos speedup Figure 3 shows this behavior ie speedupincreases with granularity and is bounded by the acceleration lsquoArsquo At

1not to be confused with operational intensity [54] which signifies operations performedper byte of DRAM traffic

ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

g1 gA2

1

A2

A

Granularity (Bytes)

Sp

eed

up

(g)

Figure 3 A graphical description of the performance metrics

one extreme for large granularities equation (2) becomes

limgrarrinfin

Speedup(g) = A (3)

While for small granularities equation (2) reduces to

limgrarr0

Speedup(g) ≃ Co+L+ C

Alt

Co+L

(4)

Equation (4) is simply Amdahlrsquos Law [2] for accelerators demon-strating the dominating effect of overheads at small granularities

22 Performance MetricsTo help programmers decide when and how much computation tooffload we define two performance metrics These metrics are in-spired by the vector machine metrics Nv and N12[18] where Nvis the vector length to make vector mode faster than scalar modeand N12 is the vector length to achieve half of the peak perfor-mance Since vector length is an important parameter in determiningperformance gains for vector machines these metrics characterizethe behavior and efficiency of vector machines with reference toscalar machines Our metrics tend to serve the same purpose in theaccelerator domain

g1 The granularity to achieve a speedup of 1 (Figure 3) It isthe break-even point where the acceleratorrsquos performance becomesequal to the host Thus it is the minimum granularity at which anaccelerator starts providing benefits Solving equation (2) for g1gives

g1 =

[(A

Aminus1

)lowast(

o+LC

)] 1β

(5)

IMPLICATION 1 g1 is essentially independent of accelerationfor large values of lsquoArsquo

For reducing g1 the above implication guides an architect toinvest resources in improving the interface

IMPLICATION 2 Doubling computational index reduces g1 by

2minus1β

The above implication demonstrates the effect of algorithmiccomplexity on g1 and shows that varying computational index has aprofound effect on g1 for sub-linear algorithms For example for asub-linear algorithm with β = 05 doubling the computational indexdecreases g1 by a factor of four However for linear (β = 1) andquadratic (β = 2) algorithms g1 decreases by factors of two and

radic2

respectively

g A2 The granularity to achieve a speedup of half of the acceler-

ation This metric provides information about a systemrsquos behaviorafter the break-even point and shows how quickly the speedup canramp towards acceleration Solving equation (2) for g A

2gives

g A2=

[Alowast(

o+LC

)] 1β

(6)

Using equation (5) and (6) g1 and g A2are related as

g A2= (Aminus1)

1β lowastg1 (7)

IMPLICATION 3 Doubling acceleration lsquoArsquo increases the gran-

ularity to attain A2 by 2

The above implication demonstrates the effect of accelerationon g A

2and shows that this effect is more pronounced for sub-linear

algorithms For example for a sub-linear algorithm with β = 05doubling acceleration increases g A

2by a factor of four However for

linear and quadratic algorithms g A2increases by factors of two and

radic2 respectivelyFor architects equation (7) also exposes an interesting design

trade-off between acceleration and performance metrics Typicallyan architect may prefer higher acceleration and lower g1 g A

2 How-

ever equation (7) shows that increasing acceleration also increasesg A

2 This presents a dilemma for an architect to favor either higher

acceleration or reduced granularity especially for sub-linear algo-rithms LogCA helps by exposing these trade-offs at an early designstage

In our model we also use g1 to determine the complexity of thesystemrsquos interface A lower g1 (on the left side of plot in Figure 3)is desirable as it implies a system with lower overheads and thus asimpler interface Likewise g1 increases with the complexity of theinterface or when an accelerator moves further away from the host

23 Granularity dependent latencyThe previous section assumed latency is granularity independent butwe have observed granularity dependent latencies in GPUs In thissection we discuss the effect of granularity on speedup and deriveperformance metrics assuming granularity dependent-latency

Assuming granularity dependent latency equation (1) reduces to

Speedup(g) =C lowastgβ

o+Llowastg+ Clowastgβ

A

(8)

For large granularities equation (8) reduces to

limgrarrinfin

Speedup(g) =

(A

AClowastgβ

lowast (Llowastg)+1

)lt

CLlowastgβminus1 (9)

Unlike equation (3) speedup in the above equation approachesCL lowastgβminus1 at large granularities Thus for linear algorithms with gran-ularity dependent latency instead of acceleration speedup is limitedby C

L However for super-linear algorithms this limit increases by afactor of gβminus1 whereas for sub-linear algorithms this limit decreasesby a factor of gβminus1

LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

IMPLICATION 4 With granularity dependent latency the speedupfor sub-linear algorithms asymptotically decreases with the increasein granularity

The above implication suggests that for sub-linear algorithms onsystems with granularity dependent latency speedup may decreasefor some large granularities This happens because for large granu-larities the communication latency (a linear function of granularity)may be higher than the computation time (a sub-linear function ofgranularity) on the accelerator resulting in a net de-accelerationThis implication is surprising as earlier we observed thatmdashfor sys-tems with granularity independent latencymdashspeedup for all algo-rithms increase with granularity and approaches acceleration forvery large granularities

For very small granularities equation (8) reduces to

limgrarr 0

Speedup(g) ≃ Alowast CAlowast (o+L)+C

(10)

Similar to equation (4) the above equation exposes the increasingeffects of overheads at small granularities Solving equation (8) forg1 using Newtonrsquos method [53]

g1 =C lowast (β minus1) lowast (Aminus1)+Alowasto

C lowastβ lowast (Aminus1)minusAlowastL(11)

For a positive value of g1 equation (11) must satisfy CL gt 1

β

Thus for achieving any speedup for linear algorithms CL should

be at least 1 However for super-linear algorithms a speedup of 1can achieved at values of C

L smaller than 1 whereas for sub-linearalgorithms algorithms C

L must be greater than 1

IMPLICATION 5 With granularity dependent latency computa-tional intensity for sub-linear algorithms should be greater than 1to achieve any gains

Thus for sub-linear algorithms computational index has to begreater than latency to justify offloading the work However forhigher-complexity algorithms computational index can be quitesmall and still be potentially useful to offload

Similarly solving equation (8) using Newtonrsquos method for g A2

gives

g A2=

C lowast (β minus1)+AlowastoC lowastβ minusAlowastL

(12)

For a positive value of g A2 equation (12) must satisfy CA

L gt 1β

Thus for achieving a speedup of A2 CL should be at least lsquoArsquo for

linear algorithms However for super-linear algorithms a speedupof A

2 can achieved at values of CL smaller than lsquoArsquo whereas for

sub-linear algorithms CL must be greater than lsquoArsquo

IMPLICATION 6 With granularity dependent latency accelera-torrsquos computational intensity for sub-linear algorithms should begreater than 1 to achieve speedup of half of the acceleration

The above implication suggests that for achieving half of theacceleration with sub-linear algorithms the computation time on theaccelerator must be greater than latency However for super-linearalgorithms that speedup can be achieved even if the computationtime on accelerator is lower than latency Programmers can usethe above implications to determinemdashearly in the design cyclemdashwhether to put time and effort in porting a code to an accelerator

g1

1

A

CL

limgrarrinfin Speedup(g) = A

CL gt A

Sp

eed

up

g1

1

CL

A

limgrarrinfin Speedup(g) = A

CL lt A

g1

1

CL

A

limgrarrinfin Speedup(g) = CL

CL lt A

Granularity (Bytes)

Sp

eed

up

g1

1

CL

A

limgrarrinfin Speedup(g) lt CL

CL lt A

Granularity (Bytes)

(a) Performance bounds for compute-bound kernels

(b) Performance bounds for latency-bound kernels

Figure 4 LogCA helps in visually identifying (a) compute and(b) latency bound kernels

For example consider a system with a minimum desirable speedupof one half of the acceleration but has a computational intensity ofless than the acceleration With the above implication architectsand programmers can infer early in the design stage that the desiredspeedup can not be achieved for sub-linear and linear algorithmsHowever the desired speedup can be achieved with super-linearalgorithms

We are also interested in quantifying the limits on achievablespeedup due to overheads and latencies To do this we assume ahypothetical accelerator with infinite acceleration and calculate thegranularity (gA) to achieve the peak speedup of lsquoArsquo With this as-sumption the desired speedup of lsquoArsquo is only limited by the overheadsand latencies Solving equation (8) for gA gives

gA =C lowast (β minus1)+Alowasto

C lowastβ minusAlowastL(13)

Surprisingly we find that the above equation is similar to equa-tion (12) ie gA equals g A

2 This observation shows that with a

hypothetical accelerator the peak speedup can now be achieved atthe same granularity as g A

2 This observation also demonstrates that

if g A2is not achievable on a system ie CA

L lt 1β

as per equation(12) then despite increasing the acceleration gA will not be achiev-able and the speedup will still be bounded by the computationalintensity

IMPLICATION 7 If a speedup of A2 is not achievable on an ac-

celerator with acceleration lsquoArsquo despite increasing acceleration toAtilde (where Atilde gt A) the speedup is bounded by the computationalintensity

The above implication helps architects in allocating more re-sources for an efficient interface instead of increasing acceleration

ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

No variation

Granularity (Bytes)

Sp

eed

up

(a) Latency

LogCAL110x

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

Granularity (Bytes)

(b) Overheads

LogCAo110x

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

Granularity (Bytes)

(c) Computational Index

LogCAC10x

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

CL

Granularity (Bytes)

(d) Acceleration

LogCAA10x

Figure 5 The effect on speedup of 10x improvement in each LogCA parameter The base case is the speedup of AES [30] on Ultra-SPARC T2

3 APPLICATIONS OF LogCAIn this section we describe the utility of LogCA for visually iden-tifying the performance bounds design bottlenecks and possibleoptimizations to alleviate these bottlenecks

31 Performance BoundsEarlier we have observed that the speedup is bounded by eitheracceleration (equation 3) or the product of computational intensityand gβminus1 (equation 9) Using these observations we classify kernelseither as compute-bound or latency-bound For compute-bound ker-nels the achievable speedup is bounded by acceleration whereas forthe latency-bound kernels the speedup is bounded by computationalintensity Based on this classification a compute-bound kernel caneither be running on a system with granularity independent latencyor has super-linear complexity while running on a system with gran-ularity dependent latency Figure 4-a illustrates these bounds forcompute-bound kernels On the other hand a latency-bound kernelis running on a system with granularity dependent latency and haseither linear or sub-linear complexity Figure 4-b illustrates thesebounds for latency-bound kernels

Programmers and architects can visually identify these boundsand use this information to invest their time and resources in the rightdirection For example for compute-bound kernelsmdashdependingon the operating granularitymdashit may be beneficial to invest moreresources in either increasing acceleration or reducing overheadsHowever for latency-bound kernels optimizing acceleration andoverheads is not that critical but decreasing latency and increasingcomputational index maybe more beneficial

32 Sensitivity AnalysisTo identify the design bottlenecks we perform a sensitivity analysisof the LogCA parameters We consider a parameter a design bottle-neck if a 10x improvement in it provides at lest 20 improvement inspeedup A lsquobottleneckedrsquo parameter also provides an optimizationopportunity To visually identify these bottlenecks we introduceoptimization regions As an example we identify design bottlenecksin UltraSPARC T2rsquos crypto accelerator by varying its individualparameters 2 in Figure 5 (a)-(d)

2We elaborate our methodology for measuring LogCA parameters later (sect 4)

Figure 5 (a) shows the variation (or the lack of) in speedup withthe decrease in latency The resulting gains are negligible and inde-pendent of the granularity as it is a closely coupled accelerator

Figure 5 (b) shows the resulting speedup after reducing overheadsSince the overheads are one-time initialization cost and independentof granularity the per byte setup cost is high at small granularitiesDecreasing these overheads considerably reduces the per byte setupcost and results in significant gains at these smaller granularitiesConversely for larger granularities the per byte setup cost is alreadyamortized so reducing overheads does not provide much gainsThus overhead is a bottleneck at small granularities and provide anopportunity for optimization

Figure 5 (c) shows the effect of increasing the computationalindex The results are similar to optimizing overheads in Figure 5 (b)ie significant gains for small granularities and a gradual decreasein the gains with increasing granularity With the constant overheadsincreasing computational index increases the computation time of thekernel and decreases the per byte setup cost For smaller granularitiesthe reduced per byte setup cost results in significant gains

Figure 5 (d) shows the variation in speedup with increasing peakacceleration The gains are negligible at small granularities andbecome significant for large granularities As mentioned earlierthe per byte setup cost is high at small granularities and it reducesfor large granularities Since increasing peak acceleration does notreduce the per byte setup cost optimizing peak acceleration providesgains only at large granularities

We group these individual sensitivity plots in Figure 6 to buildthe optimization regions As mentioned earlier each region indicatesthe potential of 20 gains with 10x variation of one or more LogCAparameters For the ease of understanding we color these regionsand label them with their respective LogCA parameters For exam-ple the blue colored region labelled lsquooCrsquo (16B to 2KB) indicatesan optimization region where optimizing overheads and computa-tional index is beneficial Similarly the red colored region labelledlsquoArsquo (32KB to 32MB) represents an optimization region where opti-mizing peak acceleration is only beneficial The granularity rangeoccupied by a parameter also identifies the scope of optimizationfor an architect and a programmer For example for UltraSPARCT2 overheads occupy most of the lower granularity suggesting op-portunity for improving the interface Similarly the absence of thelatency parameter suggests little benefits for optimizing latency

We also add horizontal arrows to the optimization regions inFigure 6 to demarcate the start and end of granularity range for each

LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

Table 2 Description of the Cryptographic accelerators

Crypto Accelerator PCI Crypto UltraSPARC T2 SPARC T3 SPARC T4 Sandy BridgeProcessor AMD A8-3850 S2 S2 S3 Intel Core i7-2600Frequency 29 GHz 116 GHz 165 GHz 3 GHz 34 GHzOpenSSL version 098o 098o 098o 102 101k 098oKernel Ubuntu 3130-55 Oracle Solaris 11 Oracle Solaris 11 Oracle Solaris 112 Linux2632-504

16 128 g1 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

CL

oC AoCA

oC

A

Granularity (Bytes)

Sp

eed

up

LogCA L110xo110x C10x A10x

Figure 6 Optimization regions for UltraSPARC T2 The pres-ence of a parameter in an optimization region indicates thatit can at least provides 20 gains The horizontal arrow in-dicates the cut-off granularity at which a parameter provides20 gains

parameter For example optimizing acceleration starts providingbenefits from 2KB while optimizing overheads or computationalindex is beneficial up till 32KB These arrows also indicate thecut-off granularity for each parameter These cut-off granularitiesprovide insights to architects and programmers about the designbottlenecks For example high cut-off granularity of 32KB suggestshigh overheads and thus a potential for optimization

4 EXPERIMENTAL METHODOLOGYThis section describes the experimental setup and benchmarks forvalidating LogCA on real machines We also discuss our methodol-ogy for measuring LogCA parameters and performance metrics

Our experimental setup comprises of on-chip and off-chip cryptoaccelerators (Table 2) and three different GPUs (Table 3) The on-chip crypto accelerators include cryptographic units on SunOracleUltraSPARC T2 [40] SPARC T3 [35] SPARC T4 [41] and AES-NI(AES New Instruction) [15] on Sandy Bridge whereas the off-chipaccelerator is a Hifn 7955 chip connected through the PCIe bus [43]The GPUs include a discrete NVIDIA GPU an integrated AMDGPU (APU) and HSA supported integrated GPU

For the on-chip crypto accelerators each core in UltraSPARC T2and SPARC T3 has a physically addressed crypto unit which requiresprivileged DMA calls However the crypto unit on SPARC T4 isintegrated within the pipeline and does not require privileged DMAcalls SPARC T4 also provides non-privileged crypto instructions toaccess the crypto unit Similar to SPARC T4 sandy bridge providesnon-privileged crypto instructionmdashAESNI

Considering the GPUs the discrete GPU is connected throughthe PCIe bus whereas for the APU the GPU is co-located with thehost processor on the same die For the APU the system memoryis partitioned between host and GPU memory This eliminates thePCIe bottleneck of data copying but it still requires copying databetween memories Unlike discrete GPU and APU HSA supportedGPU provides a unified and coherent view of the system memoryWith the host and GPU share the same virtual address space explicitcopying of data between memories is not required

Our workloads consist of encryption hashing and GPU kernelsFor encryption and hashing we have used advanced encryptionstandard (AES) [30] and standard hashing algorithm (SHA) [31]respectively from OpenSSL [34]mdashan open source cryptography li-brary For GPU kernels we use matrix multiplication radix sortFFT and binary search from AMD OpenCL SDK [1] Table 4 we listthe complexities of each kernel both in terms of number of elementsn and granularity g We expect these complexities to remain same inboth cases but we observe that they differ for matrix multiplicationFor example for a square matrix of size n matrix multiplication hascomplexity of O (n3) whereas the complexity in terms of granularityis O (g17) This happens because for matrix multiplicationmdashunlikeothersmdashcomputations are performed on matrices and not vectorsSo offloading a square matrix of size n corresponds to offloading n2

elements which results in the apparent discrepancy in the complexi-ties We also observe that for the granularity range of 16B to 32MBβ = 011 provides a close approximation for log(g)

Table 3 Description of the GPUs

Platform Discrete GPU Integrated APU AMD HSAName Tesla C2070 Radeon HD 6550 Radeon R7Architecture Fermi Beaver Creek KaveriCores 16 5 8Compute Units 448 400 512Clock Freq 15 GHz 600 MHz 720 MHzPeak FLOPS 1 T 480 G 856 GHostProcessor Intel AMD AMD

Xeon E5520 A8-3850 A10-7850KFrequency GHz 227 29 17

For calculating execution times we have used Linux utilities onthe crypto accelerators whereas for the GPUs we have used NVIDIAand AMD OpenCL profilers to compute the setup kernel and datatransfer times and we report the average of one hundred executionsFor verifying the usage of crypto accelerators we use built-in coun-ters in UltraSPARC T2 and T3 [46] SPARC T4 however no longer

ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

Table 4 Algorithmic complexity of various kernels with num-ber of elements and granularity The power of g represents β

for each kernel

Kernel Algorithmic ComplexityAdvanced Encryption Standard (AES) O (n) O (g101)Secure Hashing Algorithm (SHA) O (n) O (g097)Matrix Multiplication (GEMM) O (n3) O (g17)Fast Fourier Transform (FFT) O (n logn) O (g12)Radix Sort O (kn) O (g094)Binary Search O (logn) O (g014)

Table 5 Calculated values of LogCA Parameters

LogCA ParametersDevice Benchmark L o C A

(cycles) (cycles) (cyclesB)

Discrete GPU

AES 174Radix Sort 290GEMM 3times103 2times108 2 30FFT 290Binary Search 116

APU

AES 174Radix Sort 290GEMM 15 4times108 2 7FFT 290Binary Search 116

UltraSPARC T2 AES 1500 29times104 90 19SHA 105times103 72 12

SPARC T3 AES 1500 27times104 90 12SHA 105times103 72 10

SPARC T4 AES 500 435 32 12SHA 16times103 32 10

SPARC T4 instr AES 4 111 32 12SHA 1638 32 10

Sandy Bridge AES 3 10 35 6

supports these counters so we use Linux utilities to trace the execu-tion of the crypto instructions [3] We use these execution times todetermine LogCA parameters We calculate these parameters onceand can be later used for different kernels on the same system

For computational index and β we profile the CPU code on thehost by varying the granularity from 16B to 32MB At each granu-larity we measure the execution time and use regression analysisto determine C and β For overheads we use the observation thatfor very small granularities the execution time for a kernel on anaccelerator is dominated by the overheads ie limgrarr0 T1 (g) ≃ oFor acceleration we use different methods for the on-chip accelera-tors and GPUs For on-chip accelerators we calculate accelerationusing equation (3) and the observation that the speedup curve flat-tens out and approaches acceleration for very large granularitiesHowever for the GPUs we do not use equation (3) as it requirescomputing acceleration for each kernel as each application has adifferent access pattern which affects the speedup So we boundthe maximum performance using the peak flops from the devicespecifications We use the ratio of peak GFLOPs on CPU and GPUie A = Peak GFLOPGPU

Peak GFLOPCPU Similar to acceleration we use two different

techniques for calculating latency For the on-chip accelerators we

run micro-benchmarks and use execution time on host and acceler-ators On the other hand for the GPUs we compute latency usingpeak memory bandwidth of the GPU Similar to Meswani et al [29]we use the following equation for measuring data copying time forthe GPUs L = 1

BWpeak

Earlier we develop our model using assumptions of granularityindependent and dependent latencies In our setup we observe thatthe on-chip crypto accelerators and HSA-enabled GPU representaccelerators with granularity independent latency while the off-chipcrypto accelerator and discrete GPUAPU represent the granular-ity dependent accelerators For each accelerator we calculate thespeedup and performance metrics using the respective equations(sect2)

5 EVALUATIONIn this section we show that LogCA closely captures the behavior forboth off and on-chip accelerators We also list the calculate LogCAparameters in Table 5 To demonstrate the utility of our modelwe also present two case studies In these studies we consider theevolution of interface in SUNOraclersquos crypto accelerators and threedifferent GPU architectures In both cases we elaborate the designchanges using the insights LogCA provides

51 Linear-Complexity Kernels (β = 1)Figure 7 shows the curve-fitting of LogCA for AES We considerboth off-chip and on-chip accelerators connected through differentinterfaces ranging from PCIe bus to special instructions We observethat the off-chip accelerators and APU unlike on-chip acceleratorsprovide reasonable speedup only at very large granularities We alsoobserve that the achievable speedup is limited by computationalintensity for off-chip accelerators and acceleration for on-chip accel-erators This observation supports earlier implication on the limitsof speedup for granularity independent and dependent latencies inequation (3) and (9) respectively

Figure 7 also shows that UltraSPARC T2 provides higher speedupsthan Sandy Bridge but it breaks-even at a larger granularity SandyBridge on the other hand breaks-even at very small granularitybut provides limited speedup The discrete GPU with powerful pro-cessing cores has the highest acceleration among others Howeverits observed speedup is less than others due to high overheads andlatencies involved in communicating through the PCIe bus

We have also marked g1 and g A2for each accelerator in Figure 7

which help programmers and architects identify the complexity ofthe interface For example g1 for crypto instructions ie SPARCT4 and Sandy Bridge lies on the extreme left while for the off-chipaccelerators g1 lies on the far right It is worth mentioning that wehave marked g a

2for on-chip accelerators but not for the off-chip

accelerators For off-chip accelerators computational intensity isless than acceleration and as we have noted in equation (12) thatg A

2for these designs does not existWe also observe that g1 for the crypto-card connected through

the PCIe bus does not exist showing that this accelerator does notbreak-even even for large granularities Figure 7 also shows thatg1 for GPU and APU is comparable This observation shows thatdespite being an integrated GPU and not connected to the PCIe bus

LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

CL

Sp

eed

up

(a) PCIe crypto

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100A

g1

CL

(b) NVIDIA Discrete GPU

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

(c) AMD Integrated GPU (APU)

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

(d) UltraSPARC T2

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

Granularity (Bytes)

Sp

eed

up

(e) SPARC T3

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

Granularity (Bytes)

(f) SPARC T4 engine

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

gA2

CL

g1 lt 16B

Granularity (Bytes)

(g) SPARC T4 instruction

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

CL

g1 gA2lt 16B

Granularity (Bytes)

(h) AESNI on Sandy Bridge

observed LogCA

Figure 7 Speedup curve fittings plots comparing LogCA with the observed values of AES [30]

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

Granularity (Bytes)

Sp

eed

up

(a) UltraSPARC T2 engine

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

Granularity (Bytes)

(b) SPARC T3 engine

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

Granularity (Bytes)

(c) SPARC T4 engine

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

gA2

g1 lt 16B

CL

Granularity (Bytes)

(d) SPARC T4 instruction

observed LogCA

Figure 8 Speedup curve fittings plots comparing LogCA with the observed values of SHA256 [31] LogCA starts following observedvalues after 64B

APU spends considerable time in copying data from the host todevice memory

Figure 8 shows the curve fitting for SHA on various on-chipcrypto accelerators We observe that g1 and g A

2do exist as all of

these are on-chip accelerators We also observe that the LogCAcurve mostly follows the observed value However it deviates fromthe observed value before 64B This happens because SHA requiresblock size of 64B for hash computation If the block size is less than64B it pads extra bits to make the block size 64B Since LogCAdoes not capture this effect it does not follow the observed speedupfor granularity smaller than 64B

Figure 9-a shows the speedup curve fitting plots for Radix sortWe observe that LogCA does not follow observed values for smallergranularities on GPU Despite this inaccuracy LogCA accuratelypredicts g1 and g A

2 We also observe that g A

2for GPU is higher than

APU and this observation supports equation (7) that increasingacceleration increases g A

2

52 Super-Linear Complexity Kernels (β gt 1)Figures 9-b and 9-c show the speedup curve fitting plots for super-complexity kernels on discrete GPU and APU We observe that ma-trix multiplication with higher complexity (O (g17)) achieves higherspeedup than sort and FFT with lower complexities of O (g) andO (g12) respectively This observation corroborates results fromequation (9) that achievable speedup of higher-complexity algo-rithms is higher than lower-complexity algorithms We also observethat g A

2does not exist for FFT This happens because as we note in

equation (12) that for g A2to exist for FFT C

L should be greater thanA

12 However Figure 9-c shows that CL is smaller than A

12 for bothGPU and APU

53 Sub-Linear Complexity Kernels (β lt 1)Figure 9-d shows the curve fitting for binary search which is asub-linear algorithm (β = 014) We make three observations Firstg1 does not exist even for very large granularities and C

L lt 1 Thisobservation supports implication (5) that for a sub-linear algorithm

ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

Sp

eed

up

GPU

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

APU

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

GPU

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

APU

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100A

g1

CL

Granularity (Bytes)

Sp

eed

up

GPU

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1

CL

Granularity (Bytes)

APU

16 128 1K 8K 64

K51

2K 4M 32M

001

01

1

10

100A

g1 gA2

CL

Granularity (Bytes)

GPU

16 128 1K 8K 64

K51

2K 4M 32M

001

01

1

10

100

A

g1 gA2

CL

Granularity (Bytes)

APU

(a) Radix Sort (b) Matrix Multiplication

(c) FFT (d) Binary Search

observed LogCA

Figure 9 Speedup curve fittings plots comparing LogCA with the observed values of (a) Radix Sort (b) Matrix Multiplication (c) FFTand (d) Binary Search

of β = 014 CL should be greater than 7 to provide any speedup

Second for large granularities speedup starts decreasing with anincrease in granularity This observation supports our earlier claimin implication (4) that for systems with granularity dependent la-tencies speedup for sub-linear algorithms asymptotically decreasesThird LogCA deviates from the observed value at large granularitiesThis deviation occurs because LogCA does not model caches Asmentioned earlier LogCA abstracts the caches and memories witha single parameter of latency which does not capture the memory-access pattern accurately Even though LogCA does not accuratelycaptures binary search behavior it still provides an upper bound onthe achievable performance

54 Case StudiesFigure 10 shows the evolution of crypto accelerators in SPARCarchitectures from the off-chip accelerators in pre-Niagara (Figure 10(a)) to accelerators integrated within the pipeline in SPARC T4(Figure 10 (e)) We observe that latency is absent in the on-chipacceleratorsrsquo optimization regions as these accelerators are closelycoupled with the host We also note that the optimization regionwith overheadsmdashrepresenting the complexity of an acceleratorrsquosinterfacemdashshrinks while the optimization regions with accelerationexpand from Figure 10 (a-e) For example for the off-chip cryptoaccelerator the cut-off granularity for overheads is 256KB whereasit is 128B for the SPARC T4 suggesting a much simpler interface

Figure 10 (a) shows the optimization regions for the off-chipcrypto accelerator connected through the PCIe bus We note thatoverheads and latencies occupy most of the optimization regionsindicating high overhead OS calls and high-latency data copyingover the PCIe bus as the bottlenecks

Figure 10 (b) shows the optimization regions for UltraSPARCT2 The large cut-off granularity for overheads at 32KB suggestsa complex interface indicating high overhead OS call creating abottleneck at small granularities The cut-off granularity of 2KB foracceleration suggests that optimizing acceleration is beneficial atlarge granularities

Figure 10 (d) shows optimization regions for on-chip acceleratoron SPARC T4 There are three optimization regions with the cut-offgranularity for overhead now reduced to only 512B This observationsuggests a considerable improvement in the interface design overSPARC T3 and it is also evident by a smaller g1 We also note thatcut-off granularity for acceleration now decreases to 32B showingan increase in the opportunity for optimizing acceleration

Figure 10 (e) shows optimization regions for crypto instructionson SPARC T4 We observe that unlike earlier designs it has only twooptimization regions and the speedup approaches the peak accelera-tion at a small granularity of 128B In contrast UltraSPARC T2 andSPARC T3 do not even provide any gains at this granularity We alsoobserve that the cut-off granularity for overheads further reduces to128B suggesting some opportunity for optimization at very smallgranularities The model also shows that the acceleration occupiesthe maximum range for optimization For example optimizing accel-eration provides benefits for granularities greater than 16B The lowoverhead access which LogCA shows is due to the non-privilegedinstruction SPARC T4 uses to access the cryptographic unit whichis integrated within the pipeline

Figure 11 shows the evolution of memory interface design inGPU architectures It shows the optimization regions for matrixmultiplication on a discrete NVIDIA GPU an AMD integrated GPU(APU) and an integrated AMD GPU with HSA support We observethat matrix multiplication for all three architectures is compute bound

LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

CL

oCLoC

LC

oL

Granularity (Bytes)

Sp

eed

up

(a) PCIe Crypto Accelerator

16 128 g1 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

CL

oC AoCA

oA

Granularity (Bytes)

(b) UltraSPARC T2

16 128 g1 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000CL

A

oC AoCA

oA

Granularity (Bytes)

(c) SPARC T3

g112

8 1K 8K 64K

512K 4M 32

M

01

1

10

100

1000

A

oCA A

CL

oA

Granularity (Bytes)

Sp

eed

up

(d) SPARC T4 engine

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

oCA A

CL

oA

Granularity (Bytes)

(e) SPARC T4 instruction

LogCA L110xo110x C10x A10x

Figure 10 LogCA for performing Advanced Encryption Standard on various crypto accelerators LogCA identifies the design bottle-necks through LogCA parameters in an optimization region The bottlenecks which LogCA suggests in each design is optimized inthe next design

16 128 1K 8K g1

64K

512K 4M 32

M

01

1

10

100

1000

A

LoC LCALC A

ALC

o

Granularity (Bytes)

Sp

eed

up

(a) NVIDIA Discrete GPU

16 128 1K 8K g1

64K

512K 4M 32

M

01

1

10

100

1000

A

LoCLo

CA

ACA

AoL

C

Granularity (Bytes)

(b) AMD Integrated GPU (APU)

16 128 g11K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

oC o

CA

CA

o

AC

Granularity (Bytes)

(c) HSA supported AMD Integrated GPU

LogCA L110xo110x C10x A10x

Figure 11 Various Optimization regions for matrix multiplication over a range of granularities on (a) NVIDIA discrete GPU (b)AMD APU and (c) HSA Supported GPU

(sect31) We also observe that the computational index occupies mostof the regions which signifies maximum optimization potential

The discrete GPU has four optimization regions (Figure 11 (a))Among these latency dominates most of the regions signifyinghigh-latency data copying over the PCIe bus and thus maximumoptimization potential The high cut-off granularity for overheads at

32KB indicates high overhead OS calls to access the GPU Similarlywith highly aggressive cores acceleration has high cut-off granular-ity of 256KB indicating less optimization potential for acceleration

Similar to the discrete GPU the APU also has four optimiza-tion regions (Figure 11 (b)) There are few notable differences ascompared to the discrete GPU The cut-off granularity for latency

ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

reduces to 512KB with the elimination of data copying over thePCIe bus the overheads are still high suggesting high overhead OScalls to access the APU with less aggressive cores the cut-off granu-larity for acceleration reduces to 64KB implying more optimizationpotential for acceleration

Figure 11 (c) shows three optimization regions for the HSA en-abled integrated GPU We observe that latency is absent in all regionsand the cut-off granularity for overhead reduces to 8KB These re-ductions in overheads and latencies signify a simpler interface ascompared to the discrete GPU and APU We also observe that thecut-off granularity for acceleration drops to 2KB suggesting higherpotential for optimizing acceleration

6 RELATED WORKWe compare and contrast our work with prior approaches Lopez-Novoa et al [28] provide a detailed survey of various acceleratormodeling techniques We broadly classify these techniques in twocategories and discuss the most relevant work

Analytical Models There is a rich body of work exploring ana-lytical models for performance prediction of accelerators For somemodels the motivation is to determine the future trend in heteroge-neous architectures Chung et al [7] in a detailed study predict thefuture landscape of heterogeneous computing Hempstead et al [17]propose an early-stage model Navigo that determines the fraction ofarea required for accelerators to maintain the traditional performancetrend Nilakantan et al [32] propose to incorporate communicationcost for early-stage model of accelerator-rich architectures For oth-ers the motivation is to determine the right amount of data to offloadand the potential benefits associated with an accelerator [24]

Some analytical models are architecture specific For examplea number of studies [20 21 44 57] predict performance of GPUarchitectures Hong et al [20] present an analytical performancemodel for predicting execution time on GPUs They later extendtheir model and develop an integrated power and performance modelfor the GPUs [21] Song et al [44] use a simple counter basedapproach to predict power and performance Meswani et al [29]explore such models for high performance applications Daga etal [11] analyze the effectiveness of Accelerated processing units(APU) over GPUs and describe the communication cost over thePCIe bus as a major bottleneck in exploiting the full potential ofGPUs

In general our work is different from these studies because ofthe complexity These models use a large number of parameters toaccurately predict the power andor performance whereas we limitthe number of parameters to reduce the complexity of our modelThey also require deep understanding of the underlying architectureMost of these models also require access to GPU specific assemblyor PTX codes Unlike these approaches we use CPU code to providebounds on the performance

Roofline Models In terms of simplicity and motivation our workclosely matches the Roofline model [54]mdasha visual performancemodel for multi-core architectures Roofline exposes bottlenecks fora kernel and suggests several optimizations which programmers canuse to fine tune the kernel on a given system

A number of extensions of Roofline have been proposed [1022 33 56] and some of these extensions are architecture specific

For example targeting GPUs [22] vector processors [39] and FP-GAs [10 56]

Despite the similarities roofline and its extensions cannot be usedfor exposing design bottlenecks in an acceleratorrsquos interface Theprimary goal of roofline models has been to help programmers andcompiler writer while LogCA provides more insights for architects

7 CONCLUSION AND FUTURE WORKWith the recent trend towards heterogeneous computing we feelthat the architecture community lacks a model to reason about theneed of accelerators In this respect we propose LogCAmdashan insight-ful visual performance model for hardware accelerators LogCAprovides insights early in the design stage to both architects andprogrammers and identifies performance bounds exposes interfacedesign bottlenecks and suggest optimizations to alleviate these bot-tlenecks We have validated our model across a range of on-chip andoff-chip accelerators and have shown its utility using retrospectivestudies describing the evolution of acceleratorrsquos interface in thesearchitectures

The applicability of LogCA can be limited by our simplifying as-sumptions and for more realistic analysis we plan to overcome theselimitations in our future work For example We also assume a singleaccelerator system and do not explicitly model contention amongresources Our model should handle multi-accelerator and pipelinedscenarios For fixed function accelerators our design space is cur-rently limited to encryption and hashing kernels To overcome thiswe are expanding our design space with compression and databaseaccelerators in Oracle M7 processor We also plan to complementLogCA with an energy model as energy efficiency is a prime designmetric for accelerators

ACKNOWLEDGEMENTSWe thank our anonymous reviewers Arkaprava Basu and TonyNowatzki for their insightful comments and feedback on the paperThanks to Mark Hill Michael Swift Wisconsin Computer Archi-tecture Affiliates and other members of the Multifacet group fortheir valuable discussions We also thank Brian Wilson at Universityof Wisconsin DoIT and Eric Sedlar at Oracle Labs for providingaccess to SPARC T3 and T4 servers respectively Also thanks toMuhammad Umair Bin Altaf for his help in the formulation Thiswork is supported in part by the National Science Foundation (CNS-1302260 CCF-1438992 CCF-1533885 CCF- 1617824) Googleand the University of Wisconsin-Madison (Amar and Balindar SohiProfessorship in Computer Science) Wood has a significant financialinterest in AMD and Google

REFERENCES[1] Advanced Micro Devices 2016 APP SDK - A Complete Development Platform

Advanced Micro Devices httpdeveloperamdcomtools-and-sdksopencl-zoneamd-accelerated-parallel-processing-app-sdk

[2] Gene M Amdahl 1967 Validity of the single processor approach to achievinglarge scale computing capabilities Proceedings of the April 18-20 1967 springjoint computer conference on - AFIPS rsquo67 (Spring) (1967) 483 httpsdoiorg10114514654821465560

[3] Dan Anderson 2012 How to tell if SPARC T4 crypto is being used httpsblogsoraclecomDanXentryhow_to_tell_if_sparc

[4] Krste Asanovic Rastislav Bodik James Demmel Tony Keaveny Kurt KeutzerJohn Kubiatowicz Nelson Morgan David Patterson Koushik Sen John

LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

Wawrzynek David Wessel and Katherine Yelick 2009 A View of the Par-allel Computing Landscape Commun ACM 52 10 (oct 2009) 56ndash67 httpsdoiorg10114515627641562783

[5] Nathan Beckmann and Daniel Sanchez 2016 Cache Calculus Modeling Cachesthrough Differential Equations Computer Architecture Letters PP 99 (2016)1 httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=7366753$delimiter026E30F$npapers3publicationdoi101109LCA20152512873

[6] C Cascaval S Chatterjee H Franke K J Gildea and P Pattnaik 2010 Ataxonomy of accelerator architectures and their programming models IBMJournal of Research and Development 54 (2010) 51ndash510 httpsdoiorg101147JRD20102059721

[7] Eric S Chung Peter a Milder James C Hoe and Ken Mai 2010 Single-ChipHeterogeneous Computing Does the Future Include Custom Logic FPGAs andGPGPUs 2010 43rd Annual IEEEACM International Symposium on Microar-chitecture (dec 2010) 225ndash236 httpsdoiorg101109MICRO201036

[8] Jason Cong Zhenman Fang Michael Gill and Glenn Reinman 2015 PARADEA Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Archi-tectural Design and Exploration In 2015 IEEEACM International Conference onComputer-Aided Design Austin TX

[9] D Culler R Karp D Patterson and A Sahay 1993 LogP Towards a realisticmodel of parallel computation In Proceedings of the Fourth ACM SIGPLANSymposium on Principles and Practice of Parallel Programming 1ndash12 httpdlacmorgcitationcfmid=155333

[10] Bruno da Silva An Braeken Erik H DrsquoHollander and Abdellah Touhafi 2013Performance Modeling for FPGAs Extending the Roofline Model with High-level Synthesis Tools Int J Reconfig Comput 2013 (jan 2013) 77mdash-77httpsdoiorg1011552013428078

[11] Mayank Daga Ashwin M Aji and Wu-chun Feng 2011 On the Efficacy of aFused CPU+GPU Processor (or APU) for Parallel Computing In 2011 Symposiumon Application Accelerators in High-Performance Computing Ieee 141ndash149httpsdoiorg101109SAAHPC201129

[12] Hadi Esmaeilzadeh Emily Blem Renee St Amant Karthikeyan Sankaralingamand Doug Burger 2011 Dark silicon and the end of multicore scaling InProceeding of the 38th annual international symposium on Computer archi-tecture - ISCA rsquo11 ACM Press New York New York USA 365 httpsdoiorg10114520000642000108

[13] H Franke J Xenidis C Basso B M Bass S S Woodward J D Brown andC L Johnson 2010 Introduction to the wire-speed processor and architectureIBM Journal of Research and Development 54 (2010) 31ndash311 httpsdoiorg101147JRD20092036980

[14] Venkatraman Govindaraju Chen Han Ho and Karthikeyan Sankaralingam 2011Dynamically specialized datapaths for energy efficient computing In Proceedings- International Symposium on High-Performance Computer Architecture 503ndash514httpsdoiorg101109HPCA20115749755

[15] Shay Gueron 2012 Intel Advanced Encryption Standard (AES) Instructions SetTechnical Report Intel Corporation httpssoftwareintelcomsitesdefaultfilesarticle165683aes-wp-2012-09-22-v01pdf

[16] Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex SolomatnikovBenjamin C Lee Stephen Richardson Christos Kozyrakis and Mark Horowitz2010 Understanding sources of inefficiency in general-purpose chips Proceed-ings of the 37th annual international symposium on Computer architecture - ISCA

rsquo10 (2010) 37 httpsdoiorg10114518159611815968[17] Mark Hempstead Gu-Yeon Wei and David Brooks 2009 Navigo An early-

stage model to study power-constrained architectures and specialization In ISCAWorkshop on Modeling Benchmarking and Simulations (MoBS)

[18] John L Hennessy and David A Patterson 2006 Computer Architecture FourthEdition A Quantitative Approach 704 pages httpsdoiorg10111151881

[19] Mark D Hill and Michael R Marty 2008 Amdahlrsquos Law in the Multicore EraComputer 41 7 (jul 2008) 33ndash38 httpsdoiorg101109MC2008209

[20] Sunpyo Hong and Hyesoon Kim 2009 An analytical model for a GPU architec-ture with memory-level and thread-level parallelism awareness In Proceedingsof the 36th Annual International Symposium on Computer Architecture Vol 37152ndash163 httpsdoiorg10114515558151555775

[21] Sunpyo Hong and Hyesoon Kim 2010 An integrated GPU power and perfor-mance model In Proceedings of the 37th Annual International Symposium onComputer Architecture Vol 38 280mdash-289 httpsdoiorg10114518160381815998

[22] Haipeng Jia Yunquan Zhang Guoping Long Jianliang Xu Shengen Yan andYan Li 2012 GPURoofline A Model for Guiding Performance Optimizations onGPUs In Proceedings of the 18th International Conference on Parallel Processing(Euro-Parrsquo12) Springer-Verlag Berlin Heidelberg 920ndash932 httpsdoiorg101007978-3-642-32820-6_90

[23] Onur Kocberber Boris Grot Javier Picorel Babak Falsafi Kevin Lim andParthasarathy Ranganathan 2013 Meet the Walkers Accelerating Index Traver-sals for In-memory Databases In Proceedings of the 46th Annual IEEEACMInternational Symposium on Microarchitecture (MICRO-46) ACM New YorkNY USA 468ndash479 httpsdoiorg10114525407082540748

[24] Karthik Kumar Jibang Liu Yung Hsiang Lu and Bharat Bhargava 2013 Asurvey of computation offloading for mobile systems Mobile Networks andApplications 18 (2013) 129ndash140 httpsdoiorg101007s11036-012-0368-0

[25] Snehasish Kumar Naveen Vedula Arrvindh Shriraman and Vijayalakshmi Srini-vasan 2015 DASX Hardware Accelerator for Software Data Structures InProceedings of the 29th ACM on International Conference on Supercomputing(ICS rsquo15) ACM New York NY USA 361ndash372 httpsdoiorg10114527512052751231

[26] Maysam Lavasani Hari Angepat and Derek Chiou 2014 An FPGA-based In-Line Accelerator for Memcached IEEE Comput Archit Lett 13 2 (jul 2014)57ndash60 httpsdoiorg101109L-CA201317

[27] John D C Little and Stephen C Graves 2008 Littlersquos law In Building intuitionSpringer 81ndash100

[28] U Lopez-Novoa A Mendiburu and J Miguel-Alonso 2015 A Survey of Perfor-mance Modeling and Simulation Techniques for Accelerator-Based ComputingParallel and Distributed Systems IEEE Transactions on 26 1 (jan 2015) 272ndash281httpsdoiorg101109TPDS20142308216

[29] M R Meswani L Carrington D Unat A Snavely S Baden and S Poole2013 Modeling and predicting performance of high performance computingapplications on hardware accelerators International Journal of High Perfor-mance Computing Applications 27 (2013) 89ndash108 httpsdoiorg1011771094342012468180

[30] National Institute of Standards and Technology 2001 Advanced EncryptionStandard (AES) National Institute of Standards and Technology httpsdoiorg106028NISTFIPS197

[31] National Institute of Standards and Technology 2008 Secure Hash StandardNational Institute of Standards and Technology httpcsrcnistgovpublicationsfipsfips180-3fips180-3_finalpdf

[32] S Nilakantan S Battle and M Hempstead 2013 Metrics for Early-Stage Model-ing of Many-Accelerator Architectures Computer Architecture Letters 12 1 (jan2013) 25ndash28 httpsdoiorg101109L-CA20129

[33] Cedric Nugteren and Henk Corporaal 2012 The Boat Hull Model EnablingPerformance Prediction for Parallel Computing Prior to Code Development Cate-gories and Subject Descriptors In Proceedings of the 9th Conference on Comput-ing Frontiers ACM 203mdash-212

[34] OpenSSL Software Foundation 2015 OpenSSL Cryptography and SSLTLSToolkit OpenSSL Software Foundation httpsopensslorg

[35] Sanjay Patel 2009 Sunrsquos Next-Generation Multithreaded Processor RainbowFalls In 21st Hot Chip Symposium httpwwwhotchipsorgwp-contentuploadshc

[36] Sanjay Patel and Wen-mei W Hwu 2008 Accelerator Architectures IEEE Micro28 4 (jul 2008) 4ndash12 httpsdoiorg101109MM200850

[37] Stephen Phillips 2014 M7 Next Generation SPARC In 26th Hot Chip Sympo-sium

[38] Phil Rogers 2013 Heterogeneous system architecture overview In Hot Chips[39] Yoshiei Sato Ryuichi Nagaoka Akihiro Musa Ryusuke Egawa Hiroyuki Tak-

izawa Koki Okabe and Hiroaki Kobayashi 2009 Performance tuning andanalysis of future vector processors based on the roofline model Proceedingsof the 10th MEDEA workshop on MEmory performance DEaling with Applica-tions systems and architecture - MEDEA rsquo09 (2009) 7 httpsdoiorg10114516219601621962

[40] M Shah J Barren J Brooks R Golla G Grohoski N Gura R Hetherington PJordan M Luttrell C Olson B Sana D Sheahan L Spracklen and A Wynn2007 UltraSPARC T2 A highly-treaded power-efficient SPARC SOC InSolid-State Circuits Conference 2007 ASSCC rsquo07 IEEE Asian 22ndash25 httpsdoiorg101109ASSCC20074425786

[41] Manish Shah Robert Golla Gregory Grohoski Paul Jordan Jama Barreh JeffreyBrooks Mark Greenberg Gideon Levinsky Mark Luttrell Christopher OlsonZeid Samoail Matt Smittle and Thomas Ziaja 2012 Sparc T4 A dynamicallythreaded server-on-a-chip IEEE Micro 32 (2012) 8ndash19 httpsdoiorg101109MM20121

[42] Yakun Sophia Shao Brandon Reagen Gu-Yeon Wei and David Brooks 2014Aladdin A Pre-RTL Power-Performance Accelerator Simulator Enabling LargeDesign Space Exploration of Customized Architectures In International Sympo-sium on Computer Architecture (ISCA)

[43] Soekris Engineering 2016 vpn 1401 for Std PCI-sockets Soekris Engineeringhttpsoekriscomproductsvpn-1401html

[44] Shuaiwen Song Chunyi Su Barry Rountree and Kirk W Cameron 2013 Asimplified and accurate model of power-performance efficiency on emergent GPUarchitectures In Proceedings - IEEE 27th International Parallel and DistributedProcessing Symposium IPDPS 2013 673ndash686 httpsdoiorg101109IPDPS201373

[45] Jeff Stuecheli 2013 POWER8 In 25th Hot Chip Symposium[46] Ning Sun and Chi-Chang Lin 2007 Using the Cryptographic Accelerators in the

UltraSPARC T1 and T2 processors Technical Report httpwwworaclecomtechnetworkserver-storagesolarisdocumentation819-5782-150147pdf

ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

[47] S Tabik G Ortega and E M Garzoacuten 2014 Performance evaluation of ker-nel fusion BLAS routines on the GPU iterative solvers as case study TheJournal of Supercomputing 70 2 (nov 2014) 577ndash587 httpsdoiorg101007s11227-014-1102-4

[48] Y C Tay 2013 Analytical Performance Modeling for Computer Systems (2nded) Morgan amp Claypool Publishers

[49] MB Taylor 2012 Is dark silicon useful harnessing the four horsemen of thecoming dark silicon apocalypse In Design Automation Conference (DAC) 201249th ACMEDACIEEE 1131ndash1136 httpsdoiorg10114522283602228567

[50] G Venkatesh J Sampson N Goulding S Garcia V Bryksin J Lugo-MartinezS Swanson and M B Taylor 2010 Conservation cores Reducing the energyof mature computations In International Conference on Architectural Supportfor Programming Languages and Operating Systems - ASPLOS 205ndash218 httpsdoiorg10114517360201736044

[51] Ganesh Venkatesh Jack Sampson Nathan Goulding-Hotta Sravanthi KotaVenkata Michael Bedford Taylor and Steven Swanson 2011 QsCores Trad-ing Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores InProceedings of the 44th Annual IEEEACM International Symposium on Microar-chitecture - MICRO-44 rsquo11 163 httpsdoiorg10114521556202155640

[52] Guibin Wang Yisong Lin and Wei Yi 2010 Kernel Fusion An EffectiveMethod for Better Power Efficiency on Multithreaded GPU In Green Computingand Communications (GreenCom) 2010 IEEEACM Intrsquol Conference on CyberPhysical and Social Computing (CPSCom) 344ndash350 httpsdoiorg101109GreenCom-CPSCom2010102

[53] Eric W Weisstein 2015 Newtonrsquos Method From MathWorld ndash A Wolfram WebResource httpmathworldwolframcomNewtonsMethodhtml

[54] Samuel Williams Andrew Waterman and David Patterson 2009 Roofline aninsightful visual performance model for multicore architectures Commun ACM52 (2009) 65ndash76 httpsdoiorg10114514987651498785

[55] Lisa Wu Andrea Lottarini Timothy K Paine Martha A Kim and Kenneth ARoss 2014 Q100 The Architecture and Design of a Database Processing UnitIn Proceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS rsquo14) ACM NewYork NY USA 255ndash268 httpsdoiorg10114525419402541961

[56] Moein Pahlavan Yali 2014 FPGA-Roofline An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems PhD DissertationVirginia Polytechnic Institute and State University

[57] Yao Zhang and John D Owens 2011 A quantitative performance analysismodel for GPU architectures In Proceedings - International Symposium on High-Performance Computer Architecture 382ndash393 httpsdoiorg101109HPCA20115749745

  • Abstract
  • 1 Introduction
  • 2 The LogCA Model
    • 21 Effect of Granularity
    • 22 Performance Metrics
    • 23 Granularity dependent latency
      • 3 Applications of LogCA
        • 31 Performance Bounds
        • 32 Sensitivity Analysis
          • 4 Experimental Methodology
          • 5 Evaluation
            • 51 Linear-Complexity Kernels (= 1)
            • 52 Super-Linear Complexity Kernels (gt 1)
            • 53 Sub-Linear Complexity Kernels (lt 1)
            • 54 Case Studies
              • 6 Related Work
              • 7 Conclusion and Future Work
              • References

    ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

    accelerator and its interface with the system In some cases thesefactors may outweigh the potential benefits resulting in lower thanexpected ormdashin the worst casemdashno performance gains Figure 1illustrates such an outcome for the crypto accelerator in UltraSPARCT2 running the Advanced Encryption Standard (AES) kernel [30]

    Figure 1 provides two key observations First accelerators canunder-perform as compared to general-purpose core eg the ac-celerated version in UltraSPARC T2 outperforms the unacceleratedone only after crossing a threshold block size ie the break-evenpoint (Figure 1-a) Second different acceleratorsmdashwhile executingthe same kernelmdashhave different break-even points eg SPARC T4breaks even for smaller offloaded data while UltraSPARC T2 andGPU break even for large offloaded data (Figure 1-b)

    Understanding the factors which dictate the performance of anaccelerator are crucial for both architects and programmers Pro-grammers need to be able to predict when offloading a kernel will beperformance efficient Similarly architects need to understand howthe acceleratorrsquos interfacemdashand the resulting latency and overheadsto offload a kernelmdashwill affect the achievable accelerator perfor-mance Considering the AES encryption example programmers andarchitects would greatly benefit from understanding What bottle-necks cause UltraSPARC T2 and GPU to under-perform for smalldata sizes Which optimizations on UltraSPARC T2 and GPU resultin similar performance to SPARC T4 Which optimizations are pro-grammer dependent and which are architect dependent What arethe trade-offs in selecting one optimization over the other

    To answer these questions programmer and architects can employeither complex or simple modeling techniques Complex modelingtechniques and full-system simulation [8 42] can provide highlyaccurate performance estimates Unfortunately they often requirelow-level system details which are not available till late in the designcycle In contrast analytical modelsmdashsimpler ones in particularmdashabstract away these low-level system details and provide key insightsearly in the design cycle that are useful for experts and non-expertsalike [2 5 19 27 48 54]

    For an insightful model for hardware accelerators this paperpresents LogCA LogCA derives its name from five key parameters(Table 1) These parameters characterize the communication latency(L) and overheads (o) of the accelerator interface the granularitysize(g) of the offloaded data the complexity (C) of the computation andthe acceleratorrsquos performance improvement (A) as compared to ageneral-purpose core

    LogCA is inspired by LogP [9] the well-known parallel compu-tation model LogP sought to find the right balance between overlysimple models (eg PRAM) and the detailed reality of modern par-allel systems LogCA seeks to strike the same balance for hardwareaccelerators providing sufficient simplicity such that programmersand architects can easily reason with it Just as LogP was not thefirst model of parallel computation LogCA is not the first model forhardware accelerators [28] With LogCA our goal is to develop asimple model that supports the important implications (sect2) of ouranalysis and use as few parameters as possible while providing suf-ficient accuracy In Einsteinrsquos words we want our model to be assimple as possible and no simpler

    LogCA helps programmers and architects reason about an accel-erator by abstracting the underlying architecture It provides insights

    about the acceleratorrsquos interface by exposing the design bounds andbottlenecks and suggests optimizations to alleviate these bottlenecksThe visually identifiable optimization regions help both experts andnon-experts to quantify the trade-offs in favoring one optimizationover the other While the general trend may not be surprising weargue that LogCA is accurate enough to answer important what-ifquestions very early in the design cycle

    We validate our model across on-chip and off-chip acceleratorsfor a diverse set of kernels ranging from sub-linear to super-linearcomplexities We also demonstrate the utility of our model usingtwo retrospective case studies (sect5) In the first case study we con-sider the evolution of interface in the cryptographic accelerator onSunOraclersquos SPARC T-series processors For the second case weconsider the memory interface design in three different GPU ar-chitectures a discrete an integrated and a heterogeneous systemarchitecture (HSA) [38] supported GPU In both case studies weshow that the adopted design optimizations for these machines aresimilar to LogCArsquos suggested optimizations We argue that architectsand programmers can use insights from these retrospective studiesfor improving future designs

    This paper makes the following contributions

    bull We propose a high-level visual performance model provid-ing insights about the interface of hardware accelerators(sect2)

    bull We formalize performance metrics for predicting the ldquorightrdquoamount of offloaded data (sect22)

    bull Our model identifies the performance bounds and bottle-necks associated with an accelerator design (sect3)

    bull We provide an answer to what-if questions for both pro-grammers and architects at an early design stage (sect3)

    bull We define various optimization regions and the potentialgains associated with these regions (sect32)

    bull We demonstrate the utility of our model on five differentcryptographic accelerators and three different GPU archi-tectures (sect5)

    2 THE LogCA MODELLogCA assumes an abstract system with three components (Figure 2(a)) Host is a general-purpose processor Accelerator is a hardwaredevice designed for the efficient implementation of an algorithmand Interface connects the host and accelerator abstracting awaysystem details including the memory hierarchy

    Our model uses the interface abstraction to provide intuition forthe overhead and latency of dispatching work to an accelerator Thisabstraction enables modeling of different paradigms for attachingacceleratorsmdashdirectly connected system bus or PCIe This alsogives the flexibility to use our model for both on-chip and off-chipaccelerators This abstraction can also be trivially mapped to sharedmemory systems or other memory hierarchies in heterogeneousarchitectures The model further abstracts the underlying architectureusing the five parameters defined in Table 1

    Figure 2 (b) illustrates the overhead and latency model for anun-pipelined accelerator where computation lsquoirsquo is returned before re-questing computation lsquoi+1rsquo Figure 2 (b) also shows the breakdownof time for an algorithm on the host and accelerator We assume that

    LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

    Table 1 Description of the LogCA parameters

    Parameter Symbol Description Units

    Latency L Cycles to move data from the host to the accelerator across the interface including the cycles dataspends in the caches or memory

    Cycles

    Overhead o Cycles the host spends in setting up the algorithm Cycles

    Granularity g Size of the offloaded data Bytes

    Computational Index C Cycles the host spends per byte of data CyclesByte

    Acceleration A The peak speedup of an accelerator NA

    Host Accelerator

    Interface

    time

    Co(g)

    o1(g) L1(g)C1(g) =

    Co(g)A

    Gain

    T0(g)

    T1(g)

    (a) (b)

    Figure 2 Top level description of the LogCA model (a) Showsthe various components (b) Time-line for the computation per-formed on the host system (above) and on an accelerator (be-low)

    the algorithmrsquos execution time is a function of granularity ie thesize of the offloaded data With this assumption the unacceleratedtime T0 (time with zero accelerators) to process data of granularityg will be T0 (g) =C0 (g) where C0 (g) is the computation time on thehost

    When the data is offloaded to an accelerator the new executiontime T1 (time with one accelerator) is T1 (g) =O1 (g)+L1 (g)+C1 (g)where O1 (g) is the host overhead time in offloading lsquogrsquo bytes ofdata to the accelerator L1 (g) is the interface latency and C1 (g) is thecomputation time in the accelerator to process data of granularity g

    To make our model more concrete we make several assumptionsWe assume that an accelerator with acceleration lsquoArsquo can decreasein the absence of overheads the algorithmrsquos computation time onthe host by a factor of lsquoArsquo ie the accelerator and host use algo-rithms with the same complexity Thus the computation time on theaccelerator will be C1 (g) =

    C0 (g)A This reduction in the computation

    time results in performance gains and we quantify these gains withspeedup the ratio of the un-accelerated and accelerated time

    Speedup(g) =T0 (g)T1 (g)

    =C0 (g)

    O1 (g)+L1 (g)+C1 (g)(1)

    We assume that the computation time is a function of the com-putational index lsquoCrsquo and granularity ie C0 (g) =C lowast f (g) wheref (g) signifies the complexity of the algorithm We also assume thatf (g) is power function of rsquogrsquo ie O (gβ ) This assumption resultsin a simple closed-form model and bounds the performance for amajority of the prevalent algorithms in the high-performance comput-ing community [4] ranging from sub-linear (β lt 1) to super-linear(β gt 1) complexities However this assumption may not work wellfor logarithmic complexity algorithms ie O (log(g))O (g log(g))This is because asymptotically there is no function which grows

    slower than a logarithmic function Despite this limitation we ob-serve thatmdashin the granularity range of our interestmdashLogCA can alsobound the performance for logarithmic functions (sect5)

    For many algorithms and accelerators the overhead is indepen-dent of the granularity ie O1 (g) = o Latency on the other handwill often be granularity dependent ie L1 (g) = Llowastg Latency maybe granularity independent if the accelerator can begin operatingwhen the first byte (or block) arrives at the accelerator ie L1 (g) = LThus LogCA can also model pipelined interfaces using granularityindependent latency assumption

    We define computational intensity1 as the ratio of computationalindex to latency ie C

    L and it signifies the amount of work done ona host per byte of offloaded data Similarly we define acceleratorrsquoscomputational intensity as the ratio of computational intensity toacceleration ie CA

    L and it signifies the amount of work done onan accelerator per byte of offloaded data

    For simplicity we begin with the assumption of granularity in-dependent latency We revisit granularity dependent latencies later(sect 23) With these assumptions

    Speedup(g) =C lowast f (g)

    o+L+ Clowast f (g)A

    =C lowastgβ

    o+L+ Clowastgβ

    A

    (2)

    The above equation shows that the speedup is dependent on LogCAparameters and these parameters can be changed by architects andprogrammers through algorithmic and design choices An architectcan reduce the latency by integrating an accelerator more closelywith the host For example placing it on the processor die ratherthan on an IO bus An architect can also reduce the overheads bydesigning a simpler interface ie limited OS intervention and ad-dress translations lower initialization time and reduced data copyingbetween buffers (memories) etc A programmer can increase thecomputational index by increasing the amount of work per byteoffloaded to an accelerator For example kernel fusion [47 52]mdashwhere multiple computational kernels are fused into onemdashtends toincrease the computational index Finally an architect can typicallyincrease the acceleration by investing more chip resources or powerto an accelerator

    21 Effect of GranularityA key aspect of LogCA is that it captures the effect of granularity onthe acceleratorrsquos speedup Figure 3 shows this behavior ie speedupincreases with granularity and is bounded by the acceleration lsquoArsquo At

    1not to be confused with operational intensity [54] which signifies operations performedper byte of DRAM traffic

    ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

    g1 gA2

    1

    A2

    A

    Granularity (Bytes)

    Sp

    eed

    up

    (g)

    Figure 3 A graphical description of the performance metrics

    one extreme for large granularities equation (2) becomes

    limgrarrinfin

    Speedup(g) = A (3)

    While for small granularities equation (2) reduces to

    limgrarr0

    Speedup(g) ≃ Co+L+ C

    Alt

    Co+L

    (4)

    Equation (4) is simply Amdahlrsquos Law [2] for accelerators demon-strating the dominating effect of overheads at small granularities

    22 Performance MetricsTo help programmers decide when and how much computation tooffload we define two performance metrics These metrics are in-spired by the vector machine metrics Nv and N12[18] where Nvis the vector length to make vector mode faster than scalar modeand N12 is the vector length to achieve half of the peak perfor-mance Since vector length is an important parameter in determiningperformance gains for vector machines these metrics characterizethe behavior and efficiency of vector machines with reference toscalar machines Our metrics tend to serve the same purpose in theaccelerator domain

    g1 The granularity to achieve a speedup of 1 (Figure 3) It isthe break-even point where the acceleratorrsquos performance becomesequal to the host Thus it is the minimum granularity at which anaccelerator starts providing benefits Solving equation (2) for g1gives

    g1 =

    [(A

    Aminus1

    )lowast(

    o+LC

    )] 1β

    (5)

    IMPLICATION 1 g1 is essentially independent of accelerationfor large values of lsquoArsquo

    For reducing g1 the above implication guides an architect toinvest resources in improving the interface

    IMPLICATION 2 Doubling computational index reduces g1 by

    2minus1β

    The above implication demonstrates the effect of algorithmiccomplexity on g1 and shows that varying computational index has aprofound effect on g1 for sub-linear algorithms For example for asub-linear algorithm with β = 05 doubling the computational indexdecreases g1 by a factor of four However for linear (β = 1) andquadratic (β = 2) algorithms g1 decreases by factors of two and

    radic2

    respectively

    g A2 The granularity to achieve a speedup of half of the acceler-

    ation This metric provides information about a systemrsquos behaviorafter the break-even point and shows how quickly the speedup canramp towards acceleration Solving equation (2) for g A

    2gives

    g A2=

    [Alowast(

    o+LC

    )] 1β

    (6)

    Using equation (5) and (6) g1 and g A2are related as

    g A2= (Aminus1)

    1β lowastg1 (7)

    IMPLICATION 3 Doubling acceleration lsquoArsquo increases the gran-

    ularity to attain A2 by 2

    The above implication demonstrates the effect of accelerationon g A

    2and shows that this effect is more pronounced for sub-linear

    algorithms For example for a sub-linear algorithm with β = 05doubling acceleration increases g A

    2by a factor of four However for

    linear and quadratic algorithms g A2increases by factors of two and

    radic2 respectivelyFor architects equation (7) also exposes an interesting design

    trade-off between acceleration and performance metrics Typicallyan architect may prefer higher acceleration and lower g1 g A

    2 How-

    ever equation (7) shows that increasing acceleration also increasesg A

    2 This presents a dilemma for an architect to favor either higher

    acceleration or reduced granularity especially for sub-linear algo-rithms LogCA helps by exposing these trade-offs at an early designstage

    In our model we also use g1 to determine the complexity of thesystemrsquos interface A lower g1 (on the left side of plot in Figure 3)is desirable as it implies a system with lower overheads and thus asimpler interface Likewise g1 increases with the complexity of theinterface or when an accelerator moves further away from the host

    23 Granularity dependent latencyThe previous section assumed latency is granularity independent butwe have observed granularity dependent latencies in GPUs In thissection we discuss the effect of granularity on speedup and deriveperformance metrics assuming granularity dependent-latency

    Assuming granularity dependent latency equation (1) reduces to

    Speedup(g) =C lowastgβ

    o+Llowastg+ Clowastgβ

    A

    (8)

    For large granularities equation (8) reduces to

    limgrarrinfin

    Speedup(g) =

    (A

    AClowastgβ

    lowast (Llowastg)+1

    )lt

    CLlowastgβminus1 (9)

    Unlike equation (3) speedup in the above equation approachesCL lowastgβminus1 at large granularities Thus for linear algorithms with gran-ularity dependent latency instead of acceleration speedup is limitedby C

    L However for super-linear algorithms this limit increases by afactor of gβminus1 whereas for sub-linear algorithms this limit decreasesby a factor of gβminus1

    LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

    IMPLICATION 4 With granularity dependent latency the speedupfor sub-linear algorithms asymptotically decreases with the increasein granularity

    The above implication suggests that for sub-linear algorithms onsystems with granularity dependent latency speedup may decreasefor some large granularities This happens because for large granu-larities the communication latency (a linear function of granularity)may be higher than the computation time (a sub-linear function ofgranularity) on the accelerator resulting in a net de-accelerationThis implication is surprising as earlier we observed thatmdashfor sys-tems with granularity independent latencymdashspeedup for all algo-rithms increase with granularity and approaches acceleration forvery large granularities

    For very small granularities equation (8) reduces to

    limgrarr 0

    Speedup(g) ≃ Alowast CAlowast (o+L)+C

    (10)

    Similar to equation (4) the above equation exposes the increasingeffects of overheads at small granularities Solving equation (8) forg1 using Newtonrsquos method [53]

    g1 =C lowast (β minus1) lowast (Aminus1)+Alowasto

    C lowastβ lowast (Aminus1)minusAlowastL(11)

    For a positive value of g1 equation (11) must satisfy CL gt 1

    β

    Thus for achieving any speedup for linear algorithms CL should

    be at least 1 However for super-linear algorithms a speedup of 1can achieved at values of C

    L smaller than 1 whereas for sub-linearalgorithms algorithms C

    L must be greater than 1

    IMPLICATION 5 With granularity dependent latency computa-tional intensity for sub-linear algorithms should be greater than 1to achieve any gains

    Thus for sub-linear algorithms computational index has to begreater than latency to justify offloading the work However forhigher-complexity algorithms computational index can be quitesmall and still be potentially useful to offload

    Similarly solving equation (8) using Newtonrsquos method for g A2

    gives

    g A2=

    C lowast (β minus1)+AlowastoC lowastβ minusAlowastL

    (12)

    For a positive value of g A2 equation (12) must satisfy CA

    L gt 1β

    Thus for achieving a speedup of A2 CL should be at least lsquoArsquo for

    linear algorithms However for super-linear algorithms a speedupof A

    2 can achieved at values of CL smaller than lsquoArsquo whereas for

    sub-linear algorithms CL must be greater than lsquoArsquo

    IMPLICATION 6 With granularity dependent latency accelera-torrsquos computational intensity for sub-linear algorithms should begreater than 1 to achieve speedup of half of the acceleration

    The above implication suggests that for achieving half of theacceleration with sub-linear algorithms the computation time on theaccelerator must be greater than latency However for super-linearalgorithms that speedup can be achieved even if the computationtime on accelerator is lower than latency Programmers can usethe above implications to determinemdashearly in the design cyclemdashwhether to put time and effort in porting a code to an accelerator

    g1

    1

    A

    CL

    limgrarrinfin Speedup(g) = A

    CL gt A

    Sp

    eed

    up

    g1

    1

    CL

    A

    limgrarrinfin Speedup(g) = A

    CL lt A

    g1

    1

    CL

    A

    limgrarrinfin Speedup(g) = CL

    CL lt A

    Granularity (Bytes)

    Sp

    eed

    up

    g1

    1

    CL

    A

    limgrarrinfin Speedup(g) lt CL

    CL lt A

    Granularity (Bytes)

    (a) Performance bounds for compute-bound kernels

    (b) Performance bounds for latency-bound kernels

    Figure 4 LogCA helps in visually identifying (a) compute and(b) latency bound kernels

    For example consider a system with a minimum desirable speedupof one half of the acceleration but has a computational intensity ofless than the acceleration With the above implication architectsand programmers can infer early in the design stage that the desiredspeedup can not be achieved for sub-linear and linear algorithmsHowever the desired speedup can be achieved with super-linearalgorithms

    We are also interested in quantifying the limits on achievablespeedup due to overheads and latencies To do this we assume ahypothetical accelerator with infinite acceleration and calculate thegranularity (gA) to achieve the peak speedup of lsquoArsquo With this as-sumption the desired speedup of lsquoArsquo is only limited by the overheadsand latencies Solving equation (8) for gA gives

    gA =C lowast (β minus1)+Alowasto

    C lowastβ minusAlowastL(13)

    Surprisingly we find that the above equation is similar to equa-tion (12) ie gA equals g A

    2 This observation shows that with a

    hypothetical accelerator the peak speedup can now be achieved atthe same granularity as g A

    2 This observation also demonstrates that

    if g A2is not achievable on a system ie CA

    L lt 1β

    as per equation(12) then despite increasing the acceleration gA will not be achiev-able and the speedup will still be bounded by the computationalintensity

    IMPLICATION 7 If a speedup of A2 is not achievable on an ac-

    celerator with acceleration lsquoArsquo despite increasing acceleration toAtilde (where Atilde gt A) the speedup is bounded by the computationalintensity

    The above implication helps architects in allocating more re-sources for an efficient interface instead of increasing acceleration

    ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    1000

    A

    No variation

    Granularity (Bytes)

    Sp

    eed

    up

    (a) Latency

    LogCAL110x

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    1000

    A

    Granularity (Bytes)

    (b) Overheads

    LogCAo110x

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    1000

    A

    Granularity (Bytes)

    (c) Computational Index

    LogCAC10x

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    1000

    A

    CL

    Granularity (Bytes)

    (d) Acceleration

    LogCAA10x

    Figure 5 The effect on speedup of 10x improvement in each LogCA parameter The base case is the speedup of AES [30] on Ultra-SPARC T2

    3 APPLICATIONS OF LogCAIn this section we describe the utility of LogCA for visually iden-tifying the performance bounds design bottlenecks and possibleoptimizations to alleviate these bottlenecks

    31 Performance BoundsEarlier we have observed that the speedup is bounded by eitheracceleration (equation 3) or the product of computational intensityand gβminus1 (equation 9) Using these observations we classify kernelseither as compute-bound or latency-bound For compute-bound ker-nels the achievable speedup is bounded by acceleration whereas forthe latency-bound kernels the speedup is bounded by computationalintensity Based on this classification a compute-bound kernel caneither be running on a system with granularity independent latencyor has super-linear complexity while running on a system with gran-ularity dependent latency Figure 4-a illustrates these bounds forcompute-bound kernels On the other hand a latency-bound kernelis running on a system with granularity dependent latency and haseither linear or sub-linear complexity Figure 4-b illustrates thesebounds for latency-bound kernels

    Programmers and architects can visually identify these boundsand use this information to invest their time and resources in the rightdirection For example for compute-bound kernelsmdashdependingon the operating granularitymdashit may be beneficial to invest moreresources in either increasing acceleration or reducing overheadsHowever for latency-bound kernels optimizing acceleration andoverheads is not that critical but decreasing latency and increasingcomputational index maybe more beneficial

    32 Sensitivity AnalysisTo identify the design bottlenecks we perform a sensitivity analysisof the LogCA parameters We consider a parameter a design bottle-neck if a 10x improvement in it provides at lest 20 improvement inspeedup A lsquobottleneckedrsquo parameter also provides an optimizationopportunity To visually identify these bottlenecks we introduceoptimization regions As an example we identify design bottlenecksin UltraSPARC T2rsquos crypto accelerator by varying its individualparameters 2 in Figure 5 (a)-(d)

    2We elaborate our methodology for measuring LogCA parameters later (sect 4)

    Figure 5 (a) shows the variation (or the lack of) in speedup withthe decrease in latency The resulting gains are negligible and inde-pendent of the granularity as it is a closely coupled accelerator

    Figure 5 (b) shows the resulting speedup after reducing overheadsSince the overheads are one-time initialization cost and independentof granularity the per byte setup cost is high at small granularitiesDecreasing these overheads considerably reduces the per byte setupcost and results in significant gains at these smaller granularitiesConversely for larger granularities the per byte setup cost is alreadyamortized so reducing overheads does not provide much gainsThus overhead is a bottleneck at small granularities and provide anopportunity for optimization

    Figure 5 (c) shows the effect of increasing the computationalindex The results are similar to optimizing overheads in Figure 5 (b)ie significant gains for small granularities and a gradual decreasein the gains with increasing granularity With the constant overheadsincreasing computational index increases the computation time of thekernel and decreases the per byte setup cost For smaller granularitiesthe reduced per byte setup cost results in significant gains

    Figure 5 (d) shows the variation in speedup with increasing peakacceleration The gains are negligible at small granularities andbecome significant for large granularities As mentioned earlierthe per byte setup cost is high at small granularities and it reducesfor large granularities Since increasing peak acceleration does notreduce the per byte setup cost optimizing peak acceleration providesgains only at large granularities

    We group these individual sensitivity plots in Figure 6 to buildthe optimization regions As mentioned earlier each region indicatesthe potential of 20 gains with 10x variation of one or more LogCAparameters For the ease of understanding we color these regionsand label them with their respective LogCA parameters For exam-ple the blue colored region labelled lsquooCrsquo (16B to 2KB) indicatesan optimization region where optimizing overheads and computa-tional index is beneficial Similarly the red colored region labelledlsquoArsquo (32KB to 32MB) represents an optimization region where opti-mizing peak acceleration is only beneficial The granularity rangeoccupied by a parameter also identifies the scope of optimizationfor an architect and a programmer For example for UltraSPARCT2 overheads occupy most of the lower granularity suggesting op-portunity for improving the interface Similarly the absence of thelatency parameter suggests little benefits for optimizing latency

    We also add horizontal arrows to the optimization regions inFigure 6 to demarcate the start and end of granularity range for each

    LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

    Table 2 Description of the Cryptographic accelerators

    Crypto Accelerator PCI Crypto UltraSPARC T2 SPARC T3 SPARC T4 Sandy BridgeProcessor AMD A8-3850 S2 S2 S3 Intel Core i7-2600Frequency 29 GHz 116 GHz 165 GHz 3 GHz 34 GHzOpenSSL version 098o 098o 098o 102 101k 098oKernel Ubuntu 3130-55 Oracle Solaris 11 Oracle Solaris 11 Oracle Solaris 112 Linux2632-504

    16 128 g1 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    1000

    A

    CL

    oC AoCA

    oC

    A

    Granularity (Bytes)

    Sp

    eed

    up

    LogCA L110xo110x C10x A10x

    Figure 6 Optimization regions for UltraSPARC T2 The pres-ence of a parameter in an optimization region indicates thatit can at least provides 20 gains The horizontal arrow in-dicates the cut-off granularity at which a parameter provides20 gains

    parameter For example optimizing acceleration starts providingbenefits from 2KB while optimizing overheads or computationalindex is beneficial up till 32KB These arrows also indicate thecut-off granularity for each parameter These cut-off granularitiesprovide insights to architects and programmers about the designbottlenecks For example high cut-off granularity of 32KB suggestshigh overheads and thus a potential for optimization

    4 EXPERIMENTAL METHODOLOGYThis section describes the experimental setup and benchmarks forvalidating LogCA on real machines We also discuss our methodol-ogy for measuring LogCA parameters and performance metrics

    Our experimental setup comprises of on-chip and off-chip cryptoaccelerators (Table 2) and three different GPUs (Table 3) The on-chip crypto accelerators include cryptographic units on SunOracleUltraSPARC T2 [40] SPARC T3 [35] SPARC T4 [41] and AES-NI(AES New Instruction) [15] on Sandy Bridge whereas the off-chipaccelerator is a Hifn 7955 chip connected through the PCIe bus [43]The GPUs include a discrete NVIDIA GPU an integrated AMDGPU (APU) and HSA supported integrated GPU

    For the on-chip crypto accelerators each core in UltraSPARC T2and SPARC T3 has a physically addressed crypto unit which requiresprivileged DMA calls However the crypto unit on SPARC T4 isintegrated within the pipeline and does not require privileged DMAcalls SPARC T4 also provides non-privileged crypto instructions toaccess the crypto unit Similar to SPARC T4 sandy bridge providesnon-privileged crypto instructionmdashAESNI

    Considering the GPUs the discrete GPU is connected throughthe PCIe bus whereas for the APU the GPU is co-located with thehost processor on the same die For the APU the system memoryis partitioned between host and GPU memory This eliminates thePCIe bottleneck of data copying but it still requires copying databetween memories Unlike discrete GPU and APU HSA supportedGPU provides a unified and coherent view of the system memoryWith the host and GPU share the same virtual address space explicitcopying of data between memories is not required

    Our workloads consist of encryption hashing and GPU kernelsFor encryption and hashing we have used advanced encryptionstandard (AES) [30] and standard hashing algorithm (SHA) [31]respectively from OpenSSL [34]mdashan open source cryptography li-brary For GPU kernels we use matrix multiplication radix sortFFT and binary search from AMD OpenCL SDK [1] Table 4 we listthe complexities of each kernel both in terms of number of elementsn and granularity g We expect these complexities to remain same inboth cases but we observe that they differ for matrix multiplicationFor example for a square matrix of size n matrix multiplication hascomplexity of O (n3) whereas the complexity in terms of granularityis O (g17) This happens because for matrix multiplicationmdashunlikeothersmdashcomputations are performed on matrices and not vectorsSo offloading a square matrix of size n corresponds to offloading n2

    elements which results in the apparent discrepancy in the complexi-ties We also observe that for the granularity range of 16B to 32MBβ = 011 provides a close approximation for log(g)

    Table 3 Description of the GPUs

    Platform Discrete GPU Integrated APU AMD HSAName Tesla C2070 Radeon HD 6550 Radeon R7Architecture Fermi Beaver Creek KaveriCores 16 5 8Compute Units 448 400 512Clock Freq 15 GHz 600 MHz 720 MHzPeak FLOPS 1 T 480 G 856 GHostProcessor Intel AMD AMD

    Xeon E5520 A8-3850 A10-7850KFrequency GHz 227 29 17

    For calculating execution times we have used Linux utilities onthe crypto accelerators whereas for the GPUs we have used NVIDIAand AMD OpenCL profilers to compute the setup kernel and datatransfer times and we report the average of one hundred executionsFor verifying the usage of crypto accelerators we use built-in coun-ters in UltraSPARC T2 and T3 [46] SPARC T4 however no longer

    ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

    Table 4 Algorithmic complexity of various kernels with num-ber of elements and granularity The power of g represents β

    for each kernel

    Kernel Algorithmic ComplexityAdvanced Encryption Standard (AES) O (n) O (g101)Secure Hashing Algorithm (SHA) O (n) O (g097)Matrix Multiplication (GEMM) O (n3) O (g17)Fast Fourier Transform (FFT) O (n logn) O (g12)Radix Sort O (kn) O (g094)Binary Search O (logn) O (g014)

    Table 5 Calculated values of LogCA Parameters

    LogCA ParametersDevice Benchmark L o C A

    (cycles) (cycles) (cyclesB)

    Discrete GPU

    AES 174Radix Sort 290GEMM 3times103 2times108 2 30FFT 290Binary Search 116

    APU

    AES 174Radix Sort 290GEMM 15 4times108 2 7FFT 290Binary Search 116

    UltraSPARC T2 AES 1500 29times104 90 19SHA 105times103 72 12

    SPARC T3 AES 1500 27times104 90 12SHA 105times103 72 10

    SPARC T4 AES 500 435 32 12SHA 16times103 32 10

    SPARC T4 instr AES 4 111 32 12SHA 1638 32 10

    Sandy Bridge AES 3 10 35 6

    supports these counters so we use Linux utilities to trace the execu-tion of the crypto instructions [3] We use these execution times todetermine LogCA parameters We calculate these parameters onceand can be later used for different kernels on the same system

    For computational index and β we profile the CPU code on thehost by varying the granularity from 16B to 32MB At each granu-larity we measure the execution time and use regression analysisto determine C and β For overheads we use the observation thatfor very small granularities the execution time for a kernel on anaccelerator is dominated by the overheads ie limgrarr0 T1 (g) ≃ oFor acceleration we use different methods for the on-chip accelera-tors and GPUs For on-chip accelerators we calculate accelerationusing equation (3) and the observation that the speedup curve flat-tens out and approaches acceleration for very large granularitiesHowever for the GPUs we do not use equation (3) as it requirescomputing acceleration for each kernel as each application has adifferent access pattern which affects the speedup So we boundthe maximum performance using the peak flops from the devicespecifications We use the ratio of peak GFLOPs on CPU and GPUie A = Peak GFLOPGPU

    Peak GFLOPCPU Similar to acceleration we use two different

    techniques for calculating latency For the on-chip accelerators we

    run micro-benchmarks and use execution time on host and acceler-ators On the other hand for the GPUs we compute latency usingpeak memory bandwidth of the GPU Similar to Meswani et al [29]we use the following equation for measuring data copying time forthe GPUs L = 1

    BWpeak

    Earlier we develop our model using assumptions of granularityindependent and dependent latencies In our setup we observe thatthe on-chip crypto accelerators and HSA-enabled GPU representaccelerators with granularity independent latency while the off-chipcrypto accelerator and discrete GPUAPU represent the granular-ity dependent accelerators For each accelerator we calculate thespeedup and performance metrics using the respective equations(sect2)

    5 EVALUATIONIn this section we show that LogCA closely captures the behavior forboth off and on-chip accelerators We also list the calculate LogCAparameters in Table 5 To demonstrate the utility of our modelwe also present two case studies In these studies we consider theevolution of interface in SUNOraclersquos crypto accelerators and threedifferent GPU architectures In both cases we elaborate the designchanges using the insights LogCA provides

    51 Linear-Complexity Kernels (β = 1)Figure 7 shows the curve-fitting of LogCA for AES We considerboth off-chip and on-chip accelerators connected through differentinterfaces ranging from PCIe bus to special instructions We observethat the off-chip accelerators and APU unlike on-chip acceleratorsprovide reasonable speedup only at very large granularities We alsoobserve that the achievable speedup is limited by computationalintensity for off-chip accelerators and acceleration for on-chip accel-erators This observation supports earlier implication on the limitsof speedup for granularity independent and dependent latencies inequation (3) and (9) respectively

    Figure 7 also shows that UltraSPARC T2 provides higher speedupsthan Sandy Bridge but it breaks-even at a larger granularity SandyBridge on the other hand breaks-even at very small granularitybut provides limited speedup The discrete GPU with powerful pro-cessing cores has the highest acceleration among others Howeverits observed speedup is less than others due to high overheads andlatencies involved in communicating through the PCIe bus

    We have also marked g1 and g A2for each accelerator in Figure 7

    which help programmers and architects identify the complexity ofthe interface For example g1 for crypto instructions ie SPARCT4 and Sandy Bridge lies on the extreme left while for the off-chipaccelerators g1 lies on the far right It is worth mentioning that wehave marked g a

    2for on-chip accelerators but not for the off-chip

    accelerators For off-chip accelerators computational intensity isless than acceleration and as we have noted in equation (12) thatg A

    2for these designs does not existWe also observe that g1 for the crypto-card connected through

    the PCIe bus does not exist showing that this accelerator does notbreak-even even for large granularities Figure 7 also shows thatg1 for GPU and APU is comparable This observation shows thatdespite being an integrated GPU and not connected to the PCIe bus

    LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    CL

    Sp

    eed

    up

    (a) PCIe crypto

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100A

    g1

    CL

    (b) NVIDIA Discrete GPU

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    g1 gA2

    CL

    (c) AMD Integrated GPU (APU)

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    g1 gA2

    CL

    (d) UltraSPARC T2

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    g1 gA2

    CL

    Granularity (Bytes)

    Sp

    eed

    up

    (e) SPARC T3

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    g1 gA2

    CL

    Granularity (Bytes)

    (f) SPARC T4 engine

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    gA2

    CL

    g1 lt 16B

    Granularity (Bytes)

    (g) SPARC T4 instruction

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    CL

    g1 gA2lt 16B

    Granularity (Bytes)

    (h) AESNI on Sandy Bridge

    observed LogCA

    Figure 7 Speedup curve fittings plots comparing LogCA with the observed values of AES [30]

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    g1 gA2

    CL

    Granularity (Bytes)

    Sp

    eed

    up

    (a) UltraSPARC T2 engine

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    g1 gA2

    CL

    Granularity (Bytes)

    (b) SPARC T3 engine

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    g1 gA2

    CL

    Granularity (Bytes)

    (c) SPARC T4 engine

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    gA2

    g1 lt 16B

    CL

    Granularity (Bytes)

    (d) SPARC T4 instruction

    observed LogCA

    Figure 8 Speedup curve fittings plots comparing LogCA with the observed values of SHA256 [31] LogCA starts following observedvalues after 64B

    APU spends considerable time in copying data from the host todevice memory

    Figure 8 shows the curve fitting for SHA on various on-chipcrypto accelerators We observe that g1 and g A

    2do exist as all of

    these are on-chip accelerators We also observe that the LogCAcurve mostly follows the observed value However it deviates fromthe observed value before 64B This happens because SHA requiresblock size of 64B for hash computation If the block size is less than64B it pads extra bits to make the block size 64B Since LogCAdoes not capture this effect it does not follow the observed speedupfor granularity smaller than 64B

    Figure 9-a shows the speedup curve fitting plots for Radix sortWe observe that LogCA does not follow observed values for smallergranularities on GPU Despite this inaccuracy LogCA accuratelypredicts g1 and g A

    2 We also observe that g A

    2for GPU is higher than

    APU and this observation supports equation (7) that increasingacceleration increases g A

    2

    52 Super-Linear Complexity Kernels (β gt 1)Figures 9-b and 9-c show the speedup curve fitting plots for super-complexity kernels on discrete GPU and APU We observe that ma-trix multiplication with higher complexity (O (g17)) achieves higherspeedup than sort and FFT with lower complexities of O (g) andO (g12) respectively This observation corroborates results fromequation (9) that achievable speedup of higher-complexity algo-rithms is higher than lower-complexity algorithms We also observethat g A

    2does not exist for FFT This happens because as we note in

    equation (12) that for g A2to exist for FFT C

    L should be greater thanA

    12 However Figure 9-c shows that CL is smaller than A

    12 for bothGPU and APU

    53 Sub-Linear Complexity Kernels (β lt 1)Figure 9-d shows the curve fitting for binary search which is asub-linear algorithm (β = 014) We make three observations Firstg1 does not exist even for very large granularities and C

    L lt 1 Thisobservation supports implication (5) that for a sub-linear algorithm

    ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    g1 gA2

    CL

    Sp

    eed

    up

    GPU

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    g1 gA2

    CL

    APU

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    g1 gA2

    GPU

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    g1 gA2

    CL

    APU

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100A

    g1

    CL

    Granularity (Bytes)

    Sp

    eed

    up

    GPU

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    A

    g1

    CL

    Granularity (Bytes)

    APU

    16 128 1K 8K 64

    K51

    2K 4M 32M

    001

    01

    1

    10

    100A

    g1 gA2

    CL

    Granularity (Bytes)

    GPU

    16 128 1K 8K 64

    K51

    2K 4M 32M

    001

    01

    1

    10

    100

    A

    g1 gA2

    CL

    Granularity (Bytes)

    APU

    (a) Radix Sort (b) Matrix Multiplication

    (c) FFT (d) Binary Search

    observed LogCA

    Figure 9 Speedup curve fittings plots comparing LogCA with the observed values of (a) Radix Sort (b) Matrix Multiplication (c) FFTand (d) Binary Search

    of β = 014 CL should be greater than 7 to provide any speedup

    Second for large granularities speedup starts decreasing with anincrease in granularity This observation supports our earlier claimin implication (4) that for systems with granularity dependent la-tencies speedup for sub-linear algorithms asymptotically decreasesThird LogCA deviates from the observed value at large granularitiesThis deviation occurs because LogCA does not model caches Asmentioned earlier LogCA abstracts the caches and memories witha single parameter of latency which does not capture the memory-access pattern accurately Even though LogCA does not accuratelycaptures binary search behavior it still provides an upper bound onthe achievable performance

    54 Case StudiesFigure 10 shows the evolution of crypto accelerators in SPARCarchitectures from the off-chip accelerators in pre-Niagara (Figure 10(a)) to accelerators integrated within the pipeline in SPARC T4(Figure 10 (e)) We observe that latency is absent in the on-chipacceleratorsrsquo optimization regions as these accelerators are closelycoupled with the host We also note that the optimization regionwith overheadsmdashrepresenting the complexity of an acceleratorrsquosinterfacemdashshrinks while the optimization regions with accelerationexpand from Figure 10 (a-e) For example for the off-chip cryptoaccelerator the cut-off granularity for overheads is 256KB whereasit is 128B for the SPARC T4 suggesting a much simpler interface

    Figure 10 (a) shows the optimization regions for the off-chipcrypto accelerator connected through the PCIe bus We note thatoverheads and latencies occupy most of the optimization regionsindicating high overhead OS calls and high-latency data copyingover the PCIe bus as the bottlenecks

    Figure 10 (b) shows the optimization regions for UltraSPARCT2 The large cut-off granularity for overheads at 32KB suggestsa complex interface indicating high overhead OS call creating abottleneck at small granularities The cut-off granularity of 2KB foracceleration suggests that optimizing acceleration is beneficial atlarge granularities

    Figure 10 (d) shows optimization regions for on-chip acceleratoron SPARC T4 There are three optimization regions with the cut-offgranularity for overhead now reduced to only 512B This observationsuggests a considerable improvement in the interface design overSPARC T3 and it is also evident by a smaller g1 We also note thatcut-off granularity for acceleration now decreases to 32B showingan increase in the opportunity for optimizing acceleration

    Figure 10 (e) shows optimization regions for crypto instructionson SPARC T4 We observe that unlike earlier designs it has only twooptimization regions and the speedup approaches the peak accelera-tion at a small granularity of 128B In contrast UltraSPARC T2 andSPARC T3 do not even provide any gains at this granularity We alsoobserve that the cut-off granularity for overheads further reduces to128B suggesting some opportunity for optimization at very smallgranularities The model also shows that the acceleration occupiesthe maximum range for optimization For example optimizing accel-eration provides benefits for granularities greater than 16B The lowoverhead access which LogCA shows is due to the non-privilegedinstruction SPARC T4 uses to access the cryptographic unit whichis integrated within the pipeline

    Figure 11 shows the evolution of memory interface design inGPU architectures It shows the optimization regions for matrixmultiplication on a discrete NVIDIA GPU an AMD integrated GPU(APU) and an integrated AMD GPU with HSA support We observethat matrix multiplication for all three architectures is compute bound

    LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    1000

    A

    CL

    oCLoC

    LC

    oL

    Granularity (Bytes)

    Sp

    eed

    up

    (a) PCIe Crypto Accelerator

    16 128 g1 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    1000

    A

    CL

    oC AoCA

    oA

    Granularity (Bytes)

    (b) UltraSPARC T2

    16 128 g1 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    1000CL

    A

    oC AoCA

    oA

    Granularity (Bytes)

    (c) SPARC T3

    g112

    8 1K 8K 64K

    512K 4M 32

    M

    01

    1

    10

    100

    1000

    A

    oCA A

    CL

    oA

    Granularity (Bytes)

    Sp

    eed

    up

    (d) SPARC T4 engine

    16 128 1K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    1000

    A

    oCA A

    CL

    oA

    Granularity (Bytes)

    (e) SPARC T4 instruction

    LogCA L110xo110x C10x A10x

    Figure 10 LogCA for performing Advanced Encryption Standard on various crypto accelerators LogCA identifies the design bottle-necks through LogCA parameters in an optimization region The bottlenecks which LogCA suggests in each design is optimized inthe next design

    16 128 1K 8K g1

    64K

    512K 4M 32

    M

    01

    1

    10

    100

    1000

    A

    LoC LCALC A

    ALC

    o

    Granularity (Bytes)

    Sp

    eed

    up

    (a) NVIDIA Discrete GPU

    16 128 1K 8K g1

    64K

    512K 4M 32

    M

    01

    1

    10

    100

    1000

    A

    LoCLo

    CA

    ACA

    AoL

    C

    Granularity (Bytes)

    (b) AMD Integrated GPU (APU)

    16 128 g11K 8K 64

    K51

    2K 4M 32M

    01

    1

    10

    100

    1000

    A

    oC o

    CA

    CA

    o

    AC

    Granularity (Bytes)

    (c) HSA supported AMD Integrated GPU

    LogCA L110xo110x C10x A10x

    Figure 11 Various Optimization regions for matrix multiplication over a range of granularities on (a) NVIDIA discrete GPU (b)AMD APU and (c) HSA Supported GPU

    (sect31) We also observe that the computational index occupies mostof the regions which signifies maximum optimization potential

    The discrete GPU has four optimization regions (Figure 11 (a))Among these latency dominates most of the regions signifyinghigh-latency data copying over the PCIe bus and thus maximumoptimization potential The high cut-off granularity for overheads at

    32KB indicates high overhead OS calls to access the GPU Similarlywith highly aggressive cores acceleration has high cut-off granular-ity of 256KB indicating less optimization potential for acceleration

    Similar to the discrete GPU the APU also has four optimiza-tion regions (Figure 11 (b)) There are few notable differences ascompared to the discrete GPU The cut-off granularity for latency

    ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

    reduces to 512KB with the elimination of data copying over thePCIe bus the overheads are still high suggesting high overhead OScalls to access the APU with less aggressive cores the cut-off granu-larity for acceleration reduces to 64KB implying more optimizationpotential for acceleration

    Figure 11 (c) shows three optimization regions for the HSA en-abled integrated GPU We observe that latency is absent in all regionsand the cut-off granularity for overhead reduces to 8KB These re-ductions in overheads and latencies signify a simpler interface ascompared to the discrete GPU and APU We also observe that thecut-off granularity for acceleration drops to 2KB suggesting higherpotential for optimizing acceleration

    6 RELATED WORKWe compare and contrast our work with prior approaches Lopez-Novoa et al [28] provide a detailed survey of various acceleratormodeling techniques We broadly classify these techniques in twocategories and discuss the most relevant work

    Analytical Models There is a rich body of work exploring ana-lytical models for performance prediction of accelerators For somemodels the motivation is to determine the future trend in heteroge-neous architectures Chung et al [7] in a detailed study predict thefuture landscape of heterogeneous computing Hempstead et al [17]propose an early-stage model Navigo that determines the fraction ofarea required for accelerators to maintain the traditional performancetrend Nilakantan et al [32] propose to incorporate communicationcost for early-stage model of accelerator-rich architectures For oth-ers the motivation is to determine the right amount of data to offloadand the potential benefits associated with an accelerator [24]

    Some analytical models are architecture specific For examplea number of studies [20 21 44 57] predict performance of GPUarchitectures Hong et al [20] present an analytical performancemodel for predicting execution time on GPUs They later extendtheir model and develop an integrated power and performance modelfor the GPUs [21] Song et al [44] use a simple counter basedapproach to predict power and performance Meswani et al [29]explore such models for high performance applications Daga etal [11] analyze the effectiveness of Accelerated processing units(APU) over GPUs and describe the communication cost over thePCIe bus as a major bottleneck in exploiting the full potential ofGPUs

    In general our work is different from these studies because ofthe complexity These models use a large number of parameters toaccurately predict the power andor performance whereas we limitthe number of parameters to reduce the complexity of our modelThey also require deep understanding of the underlying architectureMost of these models also require access to GPU specific assemblyor PTX codes Unlike these approaches we use CPU code to providebounds on the performance

    Roofline Models In terms of simplicity and motivation our workclosely matches the Roofline model [54]mdasha visual performancemodel for multi-core architectures Roofline exposes bottlenecks fora kernel and suggests several optimizations which programmers canuse to fine tune the kernel on a given system

    A number of extensions of Roofline have been proposed [1022 33 56] and some of these extensions are architecture specific

    For example targeting GPUs [22] vector processors [39] and FP-GAs [10 56]

    Despite the similarities roofline and its extensions cannot be usedfor exposing design bottlenecks in an acceleratorrsquos interface Theprimary goal of roofline models has been to help programmers andcompiler writer while LogCA provides more insights for architects

    7 CONCLUSION AND FUTURE WORKWith the recent trend towards heterogeneous computing we feelthat the architecture community lacks a model to reason about theneed of accelerators In this respect we propose LogCAmdashan insight-ful visual performance model for hardware accelerators LogCAprovides insights early in the design stage to both architects andprogrammers and identifies performance bounds exposes interfacedesign bottlenecks and suggest optimizations to alleviate these bot-tlenecks We have validated our model across a range of on-chip andoff-chip accelerators and have shown its utility using retrospectivestudies describing the evolution of acceleratorrsquos interface in thesearchitectures

    The applicability of LogCA can be limited by our simplifying as-sumptions and for more realistic analysis we plan to overcome theselimitations in our future work For example We also assume a singleaccelerator system and do not explicitly model contention amongresources Our model should handle multi-accelerator and pipelinedscenarios For fixed function accelerators our design space is cur-rently limited to encryption and hashing kernels To overcome thiswe are expanding our design space with compression and databaseaccelerators in Oracle M7 processor We also plan to complementLogCA with an energy model as energy efficiency is a prime designmetric for accelerators

    ACKNOWLEDGEMENTSWe thank our anonymous reviewers Arkaprava Basu and TonyNowatzki for their insightful comments and feedback on the paperThanks to Mark Hill Michael Swift Wisconsin Computer Archi-tecture Affiliates and other members of the Multifacet group fortheir valuable discussions We also thank Brian Wilson at Universityof Wisconsin DoIT and Eric Sedlar at Oracle Labs for providingaccess to SPARC T3 and T4 servers respectively Also thanks toMuhammad Umair Bin Altaf for his help in the formulation Thiswork is supported in part by the National Science Foundation (CNS-1302260 CCF-1438992 CCF-1533885 CCF- 1617824) Googleand the University of Wisconsin-Madison (Amar and Balindar SohiProfessorship in Computer Science) Wood has a significant financialinterest in AMD and Google

    REFERENCES[1] Advanced Micro Devices 2016 APP SDK - A Complete Development Platform

    Advanced Micro Devices httpdeveloperamdcomtools-and-sdksopencl-zoneamd-accelerated-parallel-processing-app-sdk

    [2] Gene M Amdahl 1967 Validity of the single processor approach to achievinglarge scale computing capabilities Proceedings of the April 18-20 1967 springjoint computer conference on - AFIPS rsquo67 (Spring) (1967) 483 httpsdoiorg10114514654821465560

    [3] Dan Anderson 2012 How to tell if SPARC T4 crypto is being used httpsblogsoraclecomDanXentryhow_to_tell_if_sparc

    [4] Krste Asanovic Rastislav Bodik James Demmel Tony Keaveny Kurt KeutzerJohn Kubiatowicz Nelson Morgan David Patterson Koushik Sen John

    LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

    Wawrzynek David Wessel and Katherine Yelick 2009 A View of the Par-allel Computing Landscape Commun ACM 52 10 (oct 2009) 56ndash67 httpsdoiorg10114515627641562783

    [5] Nathan Beckmann and Daniel Sanchez 2016 Cache Calculus Modeling Cachesthrough Differential Equations Computer Architecture Letters PP 99 (2016)1 httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=7366753$delimiter026E30F$npapers3publicationdoi101109LCA20152512873

    [6] C Cascaval S Chatterjee H Franke K J Gildea and P Pattnaik 2010 Ataxonomy of accelerator architectures and their programming models IBMJournal of Research and Development 54 (2010) 51ndash510 httpsdoiorg101147JRD20102059721

    [7] Eric S Chung Peter a Milder James C Hoe and Ken Mai 2010 Single-ChipHeterogeneous Computing Does the Future Include Custom Logic FPGAs andGPGPUs 2010 43rd Annual IEEEACM International Symposium on Microar-chitecture (dec 2010) 225ndash236 httpsdoiorg101109MICRO201036

    [8] Jason Cong Zhenman Fang Michael Gill and Glenn Reinman 2015 PARADEA Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Archi-tectural Design and Exploration In 2015 IEEEACM International Conference onComputer-Aided Design Austin TX

    [9] D Culler R Karp D Patterson and A Sahay 1993 LogP Towards a realisticmodel of parallel computation In Proceedings of the Fourth ACM SIGPLANSymposium on Principles and Practice of Parallel Programming 1ndash12 httpdlacmorgcitationcfmid=155333

    [10] Bruno da Silva An Braeken Erik H DrsquoHollander and Abdellah Touhafi 2013Performance Modeling for FPGAs Extending the Roofline Model with High-level Synthesis Tools Int J Reconfig Comput 2013 (jan 2013) 77mdash-77httpsdoiorg1011552013428078

    [11] Mayank Daga Ashwin M Aji and Wu-chun Feng 2011 On the Efficacy of aFused CPU+GPU Processor (or APU) for Parallel Computing In 2011 Symposiumon Application Accelerators in High-Performance Computing Ieee 141ndash149httpsdoiorg101109SAAHPC201129

    [12] Hadi Esmaeilzadeh Emily Blem Renee St Amant Karthikeyan Sankaralingamand Doug Burger 2011 Dark silicon and the end of multicore scaling InProceeding of the 38th annual international symposium on Computer archi-tecture - ISCA rsquo11 ACM Press New York New York USA 365 httpsdoiorg10114520000642000108

    [13] H Franke J Xenidis C Basso B M Bass S S Woodward J D Brown andC L Johnson 2010 Introduction to the wire-speed processor and architectureIBM Journal of Research and Development 54 (2010) 31ndash311 httpsdoiorg101147JRD20092036980

    [14] Venkatraman Govindaraju Chen Han Ho and Karthikeyan Sankaralingam 2011Dynamically specialized datapaths for energy efficient computing In Proceedings- International Symposium on High-Performance Computer Architecture 503ndash514httpsdoiorg101109HPCA20115749755

    [15] Shay Gueron 2012 Intel Advanced Encryption Standard (AES) Instructions SetTechnical Report Intel Corporation httpssoftwareintelcomsitesdefaultfilesarticle165683aes-wp-2012-09-22-v01pdf

    [16] Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex SolomatnikovBenjamin C Lee Stephen Richardson Christos Kozyrakis and Mark Horowitz2010 Understanding sources of inefficiency in general-purpose chips Proceed-ings of the 37th annual international symposium on Computer architecture - ISCA

    rsquo10 (2010) 37 httpsdoiorg10114518159611815968[17] Mark Hempstead Gu-Yeon Wei and David Brooks 2009 Navigo An early-

    stage model to study power-constrained architectures and specialization In ISCAWorkshop on Modeling Benchmarking and Simulations (MoBS)

    [18] John L Hennessy and David A Patterson 2006 Computer Architecture FourthEdition A Quantitative Approach 704 pages httpsdoiorg10111151881

    [19] Mark D Hill and Michael R Marty 2008 Amdahlrsquos Law in the Multicore EraComputer 41 7 (jul 2008) 33ndash38 httpsdoiorg101109MC2008209

    [20] Sunpyo Hong and Hyesoon Kim 2009 An analytical model for a GPU architec-ture with memory-level and thread-level parallelism awareness In Proceedingsof the 36th Annual International Symposium on Computer Architecture Vol 37152ndash163 httpsdoiorg10114515558151555775

    [21] Sunpyo Hong and Hyesoon Kim 2010 An integrated GPU power and perfor-mance model In Proceedings of the 37th Annual International Symposium onComputer Architecture Vol 38 280mdash-289 httpsdoiorg10114518160381815998

    [22] Haipeng Jia Yunquan Zhang Guoping Long Jianliang Xu Shengen Yan andYan Li 2012 GPURoofline A Model for Guiding Performance Optimizations onGPUs In Proceedings of the 18th International Conference on Parallel Processing(Euro-Parrsquo12) Springer-Verlag Berlin Heidelberg 920ndash932 httpsdoiorg101007978-3-642-32820-6_90

    [23] Onur Kocberber Boris Grot Javier Picorel Babak Falsafi Kevin Lim andParthasarathy Ranganathan 2013 Meet the Walkers Accelerating Index Traver-sals for In-memory Databases In Proceedings of the 46th Annual IEEEACMInternational Symposium on Microarchitecture (MICRO-46) ACM New YorkNY USA 468ndash479 httpsdoiorg10114525407082540748

    [24] Karthik Kumar Jibang Liu Yung Hsiang Lu and Bharat Bhargava 2013 Asurvey of computation offloading for mobile systems Mobile Networks andApplications 18 (2013) 129ndash140 httpsdoiorg101007s11036-012-0368-0

    [25] Snehasish Kumar Naveen Vedula Arrvindh Shriraman and Vijayalakshmi Srini-vasan 2015 DASX Hardware Accelerator for Software Data Structures InProceedings of the 29th ACM on International Conference on Supercomputing(ICS rsquo15) ACM New York NY USA 361ndash372 httpsdoiorg10114527512052751231

    [26] Maysam Lavasani Hari Angepat and Derek Chiou 2014 An FPGA-based In-Line Accelerator for Memcached IEEE Comput Archit Lett 13 2 (jul 2014)57ndash60 httpsdoiorg101109L-CA201317

    [27] John D C Little and Stephen C Graves 2008 Littlersquos law In Building intuitionSpringer 81ndash100

    [28] U Lopez-Novoa A Mendiburu and J Miguel-Alonso 2015 A Survey of Perfor-mance Modeling and Simulation Techniques for Accelerator-Based ComputingParallel and Distributed Systems IEEE Transactions on 26 1 (jan 2015) 272ndash281httpsdoiorg101109TPDS20142308216

    [29] M R Meswani L Carrington D Unat A Snavely S Baden and S Poole2013 Modeling and predicting performance of high performance computingapplications on hardware accelerators International Journal of High Perfor-mance Computing Applications 27 (2013) 89ndash108 httpsdoiorg1011771094342012468180

    [30] National Institute of Standards and Technology 2001 Advanced EncryptionStandard (AES) National Institute of Standards and Technology httpsdoiorg106028NISTFIPS197

    [31] National Institute of Standards and Technology 2008 Secure Hash StandardNational Institute of Standards and Technology httpcsrcnistgovpublicationsfipsfips180-3fips180-3_finalpdf

    [32] S Nilakantan S Battle and M Hempstead 2013 Metrics for Early-Stage Model-ing of Many-Accelerator Architectures Computer Architecture Letters 12 1 (jan2013) 25ndash28 httpsdoiorg101109L-CA20129

    [33] Cedric Nugteren and Henk Corporaal 2012 The Boat Hull Model EnablingPerformance Prediction for Parallel Computing Prior to Code Development Cate-gories and Subject Descriptors In Proceedings of the 9th Conference on Comput-ing Frontiers ACM 203mdash-212

    [34] OpenSSL Software Foundation 2015 OpenSSL Cryptography and SSLTLSToolkit OpenSSL Software Foundation httpsopensslorg

    [35] Sanjay Patel 2009 Sunrsquos Next-Generation Multithreaded Processor RainbowFalls In 21st Hot Chip Symposium httpwwwhotchipsorgwp-contentuploadshc

    [36] Sanjay Patel and Wen-mei W Hwu 2008 Accelerator Architectures IEEE Micro28 4 (jul 2008) 4ndash12 httpsdoiorg101109MM200850

    [37] Stephen Phillips 2014 M7 Next Generation SPARC In 26th Hot Chip Sympo-sium

    [38] Phil Rogers 2013 Heterogeneous system architecture overview In Hot Chips[39] Yoshiei Sato Ryuichi Nagaoka Akihiro Musa Ryusuke Egawa Hiroyuki Tak-

    izawa Koki Okabe and Hiroaki Kobayashi 2009 Performance tuning andanalysis of future vector processors based on the roofline model Proceedingsof the 10th MEDEA workshop on MEmory performance DEaling with Applica-tions systems and architecture - MEDEA rsquo09 (2009) 7 httpsdoiorg10114516219601621962

    [40] M Shah J Barren J Brooks R Golla G Grohoski N Gura R Hetherington PJordan M Luttrell C Olson B Sana D Sheahan L Spracklen and A Wynn2007 UltraSPARC T2 A highly-treaded power-efficient SPARC SOC InSolid-State Circuits Conference 2007 ASSCC rsquo07 IEEE Asian 22ndash25 httpsdoiorg101109ASSCC20074425786

    [41] Manish Shah Robert Golla Gregory Grohoski Paul Jordan Jama Barreh JeffreyBrooks Mark Greenberg Gideon Levinsky Mark Luttrell Christopher OlsonZeid Samoail Matt Smittle and Thomas Ziaja 2012 Sparc T4 A dynamicallythreaded server-on-a-chip IEEE Micro 32 (2012) 8ndash19 httpsdoiorg101109MM20121

    [42] Yakun Sophia Shao Brandon Reagen Gu-Yeon Wei and David Brooks 2014Aladdin A Pre-RTL Power-Performance Accelerator Simulator Enabling LargeDesign Space Exploration of Customized Architectures In International Sympo-sium on Computer Architecture (ISCA)

    [43] Soekris Engineering 2016 vpn 1401 for Std PCI-sockets Soekris Engineeringhttpsoekriscomproductsvpn-1401html

    [44] Shuaiwen Song Chunyi Su Barry Rountree and Kirk W Cameron 2013 Asimplified and accurate model of power-performance efficiency on emergent GPUarchitectures In Proceedings - IEEE 27th International Parallel and DistributedProcessing Symposium IPDPS 2013 673ndash686 httpsdoiorg101109IPDPS201373

    [45] Jeff Stuecheli 2013 POWER8 In 25th Hot Chip Symposium[46] Ning Sun and Chi-Chang Lin 2007 Using the Cryptographic Accelerators in the

    UltraSPARC T1 and T2 processors Technical Report httpwwworaclecomtechnetworkserver-storagesolarisdocumentation819-5782-150147pdf

    ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

    [47] S Tabik G Ortega and E M Garzoacuten 2014 Performance evaluation of ker-nel fusion BLAS routines on the GPU iterative solvers as case study TheJournal of Supercomputing 70 2 (nov 2014) 577ndash587 httpsdoiorg101007s11227-014-1102-4

    [48] Y C Tay 2013 Analytical Performance Modeling for Computer Systems (2nded) Morgan amp Claypool Publishers

    [49] MB Taylor 2012 Is dark silicon useful harnessing the four horsemen of thecoming dark silicon apocalypse In Design Automation Conference (DAC) 201249th ACMEDACIEEE 1131ndash1136 httpsdoiorg10114522283602228567

    [50] G Venkatesh J Sampson N Goulding S Garcia V Bryksin J Lugo-MartinezS Swanson and M B Taylor 2010 Conservation cores Reducing the energyof mature computations In International Conference on Architectural Supportfor Programming Languages and Operating Systems - ASPLOS 205ndash218 httpsdoiorg10114517360201736044

    [51] Ganesh Venkatesh Jack Sampson Nathan Goulding-Hotta Sravanthi KotaVenkata Michael Bedford Taylor and Steven Swanson 2011 QsCores Trad-ing Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores InProceedings of the 44th Annual IEEEACM International Symposium on Microar-chitecture - MICRO-44 rsquo11 163 httpsdoiorg10114521556202155640

    [52] Guibin Wang Yisong Lin and Wei Yi 2010 Kernel Fusion An EffectiveMethod for Better Power Efficiency on Multithreaded GPU In Green Computingand Communications (GreenCom) 2010 IEEEACM Intrsquol Conference on CyberPhysical and Social Computing (CPSCom) 344ndash350 httpsdoiorg101109GreenCom-CPSCom2010102

    [53] Eric W Weisstein 2015 Newtonrsquos Method From MathWorld ndash A Wolfram WebResource httpmathworldwolframcomNewtonsMethodhtml

    [54] Samuel Williams Andrew Waterman and David Patterson 2009 Roofline aninsightful visual performance model for multicore architectures Commun ACM52 (2009) 65ndash76 httpsdoiorg10114514987651498785

    [55] Lisa Wu Andrea Lottarini Timothy K Paine Martha A Kim and Kenneth ARoss 2014 Q100 The Architecture and Design of a Database Processing UnitIn Proceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS rsquo14) ACM NewYork NY USA 255ndash268 httpsdoiorg10114525419402541961

    [56] Moein Pahlavan Yali 2014 FPGA-Roofline An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems PhD DissertationVirginia Polytechnic Institute and State University

    [57] Yao Zhang and John D Owens 2011 A quantitative performance analysismodel for GPU architectures In Proceedings - International Symposium on High-Performance Computer Architecture 382ndash393 httpsdoiorg101109HPCA20115749745

    • Abstract
    • 1 Introduction
    • 2 The LogCA Model
      • 21 Effect of Granularity
      • 22 Performance Metrics
      • 23 Granularity dependent latency
        • 3 Applications of LogCA
          • 31 Performance Bounds
          • 32 Sensitivity Analysis
            • 4 Experimental Methodology
            • 5 Evaluation
              • 51 Linear-Complexity Kernels (= 1)
              • 52 Super-Linear Complexity Kernels (gt 1)
              • 53 Sub-Linear Complexity Kernels (lt 1)
              • 54 Case Studies
                • 6 Related Work
                • 7 Conclusion and Future Work
                • References

      LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

      Table 1 Description of the LogCA parameters

      Parameter Symbol Description Units

      Latency L Cycles to move data from the host to the accelerator across the interface including the cycles dataspends in the caches or memory

      Cycles

      Overhead o Cycles the host spends in setting up the algorithm Cycles

      Granularity g Size of the offloaded data Bytes

      Computational Index C Cycles the host spends per byte of data CyclesByte

      Acceleration A The peak speedup of an accelerator NA

      Host Accelerator

      Interface

      time

      Co(g)

      o1(g) L1(g)C1(g) =

      Co(g)A

      Gain

      T0(g)

      T1(g)

      (a) (b)

      Figure 2 Top level description of the LogCA model (a) Showsthe various components (b) Time-line for the computation per-formed on the host system (above) and on an accelerator (be-low)

      the algorithmrsquos execution time is a function of granularity ie thesize of the offloaded data With this assumption the unacceleratedtime T0 (time with zero accelerators) to process data of granularityg will be T0 (g) =C0 (g) where C0 (g) is the computation time on thehost

      When the data is offloaded to an accelerator the new executiontime T1 (time with one accelerator) is T1 (g) =O1 (g)+L1 (g)+C1 (g)where O1 (g) is the host overhead time in offloading lsquogrsquo bytes ofdata to the accelerator L1 (g) is the interface latency and C1 (g) is thecomputation time in the accelerator to process data of granularity g

      To make our model more concrete we make several assumptionsWe assume that an accelerator with acceleration lsquoArsquo can decreasein the absence of overheads the algorithmrsquos computation time onthe host by a factor of lsquoArsquo ie the accelerator and host use algo-rithms with the same complexity Thus the computation time on theaccelerator will be C1 (g) =

      C0 (g)A This reduction in the computation

      time results in performance gains and we quantify these gains withspeedup the ratio of the un-accelerated and accelerated time

      Speedup(g) =T0 (g)T1 (g)

      =C0 (g)

      O1 (g)+L1 (g)+C1 (g)(1)

      We assume that the computation time is a function of the com-putational index lsquoCrsquo and granularity ie C0 (g) =C lowast f (g) wheref (g) signifies the complexity of the algorithm We also assume thatf (g) is power function of rsquogrsquo ie O (gβ ) This assumption resultsin a simple closed-form model and bounds the performance for amajority of the prevalent algorithms in the high-performance comput-ing community [4] ranging from sub-linear (β lt 1) to super-linear(β gt 1) complexities However this assumption may not work wellfor logarithmic complexity algorithms ie O (log(g))O (g log(g))This is because asymptotically there is no function which grows

      slower than a logarithmic function Despite this limitation we ob-serve thatmdashin the granularity range of our interestmdashLogCA can alsobound the performance for logarithmic functions (sect5)

      For many algorithms and accelerators the overhead is indepen-dent of the granularity ie O1 (g) = o Latency on the other handwill often be granularity dependent ie L1 (g) = Llowastg Latency maybe granularity independent if the accelerator can begin operatingwhen the first byte (or block) arrives at the accelerator ie L1 (g) = LThus LogCA can also model pipelined interfaces using granularityindependent latency assumption

      We define computational intensity1 as the ratio of computationalindex to latency ie C

      L and it signifies the amount of work done ona host per byte of offloaded data Similarly we define acceleratorrsquoscomputational intensity as the ratio of computational intensity toacceleration ie CA

      L and it signifies the amount of work done onan accelerator per byte of offloaded data

      For simplicity we begin with the assumption of granularity in-dependent latency We revisit granularity dependent latencies later(sect 23) With these assumptions

      Speedup(g) =C lowast f (g)

      o+L+ Clowast f (g)A

      =C lowastgβ

      o+L+ Clowastgβ

      A

      (2)

      The above equation shows that the speedup is dependent on LogCAparameters and these parameters can be changed by architects andprogrammers through algorithmic and design choices An architectcan reduce the latency by integrating an accelerator more closelywith the host For example placing it on the processor die ratherthan on an IO bus An architect can also reduce the overheads bydesigning a simpler interface ie limited OS intervention and ad-dress translations lower initialization time and reduced data copyingbetween buffers (memories) etc A programmer can increase thecomputational index by increasing the amount of work per byteoffloaded to an accelerator For example kernel fusion [47 52]mdashwhere multiple computational kernels are fused into onemdashtends toincrease the computational index Finally an architect can typicallyincrease the acceleration by investing more chip resources or powerto an accelerator

      21 Effect of GranularityA key aspect of LogCA is that it captures the effect of granularity onthe acceleratorrsquos speedup Figure 3 shows this behavior ie speedupincreases with granularity and is bounded by the acceleration lsquoArsquo At

      1not to be confused with operational intensity [54] which signifies operations performedper byte of DRAM traffic

      ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

      g1 gA2

      1

      A2

      A

      Granularity (Bytes)

      Sp

      eed

      up

      (g)

      Figure 3 A graphical description of the performance metrics

      one extreme for large granularities equation (2) becomes

      limgrarrinfin

      Speedup(g) = A (3)

      While for small granularities equation (2) reduces to

      limgrarr0

      Speedup(g) ≃ Co+L+ C

      Alt

      Co+L

      (4)

      Equation (4) is simply Amdahlrsquos Law [2] for accelerators demon-strating the dominating effect of overheads at small granularities

      22 Performance MetricsTo help programmers decide when and how much computation tooffload we define two performance metrics These metrics are in-spired by the vector machine metrics Nv and N12[18] where Nvis the vector length to make vector mode faster than scalar modeand N12 is the vector length to achieve half of the peak perfor-mance Since vector length is an important parameter in determiningperformance gains for vector machines these metrics characterizethe behavior and efficiency of vector machines with reference toscalar machines Our metrics tend to serve the same purpose in theaccelerator domain

      g1 The granularity to achieve a speedup of 1 (Figure 3) It isthe break-even point where the acceleratorrsquos performance becomesequal to the host Thus it is the minimum granularity at which anaccelerator starts providing benefits Solving equation (2) for g1gives

      g1 =

      [(A

      Aminus1

      )lowast(

      o+LC

      )] 1β

      (5)

      IMPLICATION 1 g1 is essentially independent of accelerationfor large values of lsquoArsquo

      For reducing g1 the above implication guides an architect toinvest resources in improving the interface

      IMPLICATION 2 Doubling computational index reduces g1 by

      2minus1β

      The above implication demonstrates the effect of algorithmiccomplexity on g1 and shows that varying computational index has aprofound effect on g1 for sub-linear algorithms For example for asub-linear algorithm with β = 05 doubling the computational indexdecreases g1 by a factor of four However for linear (β = 1) andquadratic (β = 2) algorithms g1 decreases by factors of two and

      radic2

      respectively

      g A2 The granularity to achieve a speedup of half of the acceler-

      ation This metric provides information about a systemrsquos behaviorafter the break-even point and shows how quickly the speedup canramp towards acceleration Solving equation (2) for g A

      2gives

      g A2=

      [Alowast(

      o+LC

      )] 1β

      (6)

      Using equation (5) and (6) g1 and g A2are related as

      g A2= (Aminus1)

      1β lowastg1 (7)

      IMPLICATION 3 Doubling acceleration lsquoArsquo increases the gran-

      ularity to attain A2 by 2

      The above implication demonstrates the effect of accelerationon g A

      2and shows that this effect is more pronounced for sub-linear

      algorithms For example for a sub-linear algorithm with β = 05doubling acceleration increases g A

      2by a factor of four However for

      linear and quadratic algorithms g A2increases by factors of two and

      radic2 respectivelyFor architects equation (7) also exposes an interesting design

      trade-off between acceleration and performance metrics Typicallyan architect may prefer higher acceleration and lower g1 g A

      2 How-

      ever equation (7) shows that increasing acceleration also increasesg A

      2 This presents a dilemma for an architect to favor either higher

      acceleration or reduced granularity especially for sub-linear algo-rithms LogCA helps by exposing these trade-offs at an early designstage

      In our model we also use g1 to determine the complexity of thesystemrsquos interface A lower g1 (on the left side of plot in Figure 3)is desirable as it implies a system with lower overheads and thus asimpler interface Likewise g1 increases with the complexity of theinterface or when an accelerator moves further away from the host

      23 Granularity dependent latencyThe previous section assumed latency is granularity independent butwe have observed granularity dependent latencies in GPUs In thissection we discuss the effect of granularity on speedup and deriveperformance metrics assuming granularity dependent-latency

      Assuming granularity dependent latency equation (1) reduces to

      Speedup(g) =C lowastgβ

      o+Llowastg+ Clowastgβ

      A

      (8)

      For large granularities equation (8) reduces to

      limgrarrinfin

      Speedup(g) =

      (A

      AClowastgβ

      lowast (Llowastg)+1

      )lt

      CLlowastgβminus1 (9)

      Unlike equation (3) speedup in the above equation approachesCL lowastgβminus1 at large granularities Thus for linear algorithms with gran-ularity dependent latency instead of acceleration speedup is limitedby C

      L However for super-linear algorithms this limit increases by afactor of gβminus1 whereas for sub-linear algorithms this limit decreasesby a factor of gβminus1

      LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

      IMPLICATION 4 With granularity dependent latency the speedupfor sub-linear algorithms asymptotically decreases with the increasein granularity

      The above implication suggests that for sub-linear algorithms onsystems with granularity dependent latency speedup may decreasefor some large granularities This happens because for large granu-larities the communication latency (a linear function of granularity)may be higher than the computation time (a sub-linear function ofgranularity) on the accelerator resulting in a net de-accelerationThis implication is surprising as earlier we observed thatmdashfor sys-tems with granularity independent latencymdashspeedup for all algo-rithms increase with granularity and approaches acceleration forvery large granularities

      For very small granularities equation (8) reduces to

      limgrarr 0

      Speedup(g) ≃ Alowast CAlowast (o+L)+C

      (10)

      Similar to equation (4) the above equation exposes the increasingeffects of overheads at small granularities Solving equation (8) forg1 using Newtonrsquos method [53]

      g1 =C lowast (β minus1) lowast (Aminus1)+Alowasto

      C lowastβ lowast (Aminus1)minusAlowastL(11)

      For a positive value of g1 equation (11) must satisfy CL gt 1

      β

      Thus for achieving any speedup for linear algorithms CL should

      be at least 1 However for super-linear algorithms a speedup of 1can achieved at values of C

      L smaller than 1 whereas for sub-linearalgorithms algorithms C

      L must be greater than 1

      IMPLICATION 5 With granularity dependent latency computa-tional intensity for sub-linear algorithms should be greater than 1to achieve any gains

      Thus for sub-linear algorithms computational index has to begreater than latency to justify offloading the work However forhigher-complexity algorithms computational index can be quitesmall and still be potentially useful to offload

      Similarly solving equation (8) using Newtonrsquos method for g A2

      gives

      g A2=

      C lowast (β minus1)+AlowastoC lowastβ minusAlowastL

      (12)

      For a positive value of g A2 equation (12) must satisfy CA

      L gt 1β

      Thus for achieving a speedup of A2 CL should be at least lsquoArsquo for

      linear algorithms However for super-linear algorithms a speedupof A

      2 can achieved at values of CL smaller than lsquoArsquo whereas for

      sub-linear algorithms CL must be greater than lsquoArsquo

      IMPLICATION 6 With granularity dependent latency accelera-torrsquos computational intensity for sub-linear algorithms should begreater than 1 to achieve speedup of half of the acceleration

      The above implication suggests that for achieving half of theacceleration with sub-linear algorithms the computation time on theaccelerator must be greater than latency However for super-linearalgorithms that speedup can be achieved even if the computationtime on accelerator is lower than latency Programmers can usethe above implications to determinemdashearly in the design cyclemdashwhether to put time and effort in porting a code to an accelerator

      g1

      1

      A

      CL

      limgrarrinfin Speedup(g) = A

      CL gt A

      Sp

      eed

      up

      g1

      1

      CL

      A

      limgrarrinfin Speedup(g) = A

      CL lt A

      g1

      1

      CL

      A

      limgrarrinfin Speedup(g) = CL

      CL lt A

      Granularity (Bytes)

      Sp

      eed

      up

      g1

      1

      CL

      A

      limgrarrinfin Speedup(g) lt CL

      CL lt A

      Granularity (Bytes)

      (a) Performance bounds for compute-bound kernels

      (b) Performance bounds for latency-bound kernels

      Figure 4 LogCA helps in visually identifying (a) compute and(b) latency bound kernels

      For example consider a system with a minimum desirable speedupof one half of the acceleration but has a computational intensity ofless than the acceleration With the above implication architectsand programmers can infer early in the design stage that the desiredspeedup can not be achieved for sub-linear and linear algorithmsHowever the desired speedup can be achieved with super-linearalgorithms

      We are also interested in quantifying the limits on achievablespeedup due to overheads and latencies To do this we assume ahypothetical accelerator with infinite acceleration and calculate thegranularity (gA) to achieve the peak speedup of lsquoArsquo With this as-sumption the desired speedup of lsquoArsquo is only limited by the overheadsand latencies Solving equation (8) for gA gives

      gA =C lowast (β minus1)+Alowasto

      C lowastβ minusAlowastL(13)

      Surprisingly we find that the above equation is similar to equa-tion (12) ie gA equals g A

      2 This observation shows that with a

      hypothetical accelerator the peak speedup can now be achieved atthe same granularity as g A

      2 This observation also demonstrates that

      if g A2is not achievable on a system ie CA

      L lt 1β

      as per equation(12) then despite increasing the acceleration gA will not be achiev-able and the speedup will still be bounded by the computationalintensity

      IMPLICATION 7 If a speedup of A2 is not achievable on an ac-

      celerator with acceleration lsquoArsquo despite increasing acceleration toAtilde (where Atilde gt A) the speedup is bounded by the computationalintensity

      The above implication helps architects in allocating more re-sources for an efficient interface instead of increasing acceleration

      ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      1000

      A

      No variation

      Granularity (Bytes)

      Sp

      eed

      up

      (a) Latency

      LogCAL110x

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      1000

      A

      Granularity (Bytes)

      (b) Overheads

      LogCAo110x

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      1000

      A

      Granularity (Bytes)

      (c) Computational Index

      LogCAC10x

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      1000

      A

      CL

      Granularity (Bytes)

      (d) Acceleration

      LogCAA10x

      Figure 5 The effect on speedup of 10x improvement in each LogCA parameter The base case is the speedup of AES [30] on Ultra-SPARC T2

      3 APPLICATIONS OF LogCAIn this section we describe the utility of LogCA for visually iden-tifying the performance bounds design bottlenecks and possibleoptimizations to alleviate these bottlenecks

      31 Performance BoundsEarlier we have observed that the speedup is bounded by eitheracceleration (equation 3) or the product of computational intensityand gβminus1 (equation 9) Using these observations we classify kernelseither as compute-bound or latency-bound For compute-bound ker-nels the achievable speedup is bounded by acceleration whereas forthe latency-bound kernels the speedup is bounded by computationalintensity Based on this classification a compute-bound kernel caneither be running on a system with granularity independent latencyor has super-linear complexity while running on a system with gran-ularity dependent latency Figure 4-a illustrates these bounds forcompute-bound kernels On the other hand a latency-bound kernelis running on a system with granularity dependent latency and haseither linear or sub-linear complexity Figure 4-b illustrates thesebounds for latency-bound kernels

      Programmers and architects can visually identify these boundsand use this information to invest their time and resources in the rightdirection For example for compute-bound kernelsmdashdependingon the operating granularitymdashit may be beneficial to invest moreresources in either increasing acceleration or reducing overheadsHowever for latency-bound kernels optimizing acceleration andoverheads is not that critical but decreasing latency and increasingcomputational index maybe more beneficial

      32 Sensitivity AnalysisTo identify the design bottlenecks we perform a sensitivity analysisof the LogCA parameters We consider a parameter a design bottle-neck if a 10x improvement in it provides at lest 20 improvement inspeedup A lsquobottleneckedrsquo parameter also provides an optimizationopportunity To visually identify these bottlenecks we introduceoptimization regions As an example we identify design bottlenecksin UltraSPARC T2rsquos crypto accelerator by varying its individualparameters 2 in Figure 5 (a)-(d)

      2We elaborate our methodology for measuring LogCA parameters later (sect 4)

      Figure 5 (a) shows the variation (or the lack of) in speedup withthe decrease in latency The resulting gains are negligible and inde-pendent of the granularity as it is a closely coupled accelerator

      Figure 5 (b) shows the resulting speedup after reducing overheadsSince the overheads are one-time initialization cost and independentof granularity the per byte setup cost is high at small granularitiesDecreasing these overheads considerably reduces the per byte setupcost and results in significant gains at these smaller granularitiesConversely for larger granularities the per byte setup cost is alreadyamortized so reducing overheads does not provide much gainsThus overhead is a bottleneck at small granularities and provide anopportunity for optimization

      Figure 5 (c) shows the effect of increasing the computationalindex The results are similar to optimizing overheads in Figure 5 (b)ie significant gains for small granularities and a gradual decreasein the gains with increasing granularity With the constant overheadsincreasing computational index increases the computation time of thekernel and decreases the per byte setup cost For smaller granularitiesthe reduced per byte setup cost results in significant gains

      Figure 5 (d) shows the variation in speedup with increasing peakacceleration The gains are negligible at small granularities andbecome significant for large granularities As mentioned earlierthe per byte setup cost is high at small granularities and it reducesfor large granularities Since increasing peak acceleration does notreduce the per byte setup cost optimizing peak acceleration providesgains only at large granularities

      We group these individual sensitivity plots in Figure 6 to buildthe optimization regions As mentioned earlier each region indicatesthe potential of 20 gains with 10x variation of one or more LogCAparameters For the ease of understanding we color these regionsand label them with their respective LogCA parameters For exam-ple the blue colored region labelled lsquooCrsquo (16B to 2KB) indicatesan optimization region where optimizing overheads and computa-tional index is beneficial Similarly the red colored region labelledlsquoArsquo (32KB to 32MB) represents an optimization region where opti-mizing peak acceleration is only beneficial The granularity rangeoccupied by a parameter also identifies the scope of optimizationfor an architect and a programmer For example for UltraSPARCT2 overheads occupy most of the lower granularity suggesting op-portunity for improving the interface Similarly the absence of thelatency parameter suggests little benefits for optimizing latency

      We also add horizontal arrows to the optimization regions inFigure 6 to demarcate the start and end of granularity range for each

      LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

      Table 2 Description of the Cryptographic accelerators

      Crypto Accelerator PCI Crypto UltraSPARC T2 SPARC T3 SPARC T4 Sandy BridgeProcessor AMD A8-3850 S2 S2 S3 Intel Core i7-2600Frequency 29 GHz 116 GHz 165 GHz 3 GHz 34 GHzOpenSSL version 098o 098o 098o 102 101k 098oKernel Ubuntu 3130-55 Oracle Solaris 11 Oracle Solaris 11 Oracle Solaris 112 Linux2632-504

      16 128 g1 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      1000

      A

      CL

      oC AoCA

      oC

      A

      Granularity (Bytes)

      Sp

      eed

      up

      LogCA L110xo110x C10x A10x

      Figure 6 Optimization regions for UltraSPARC T2 The pres-ence of a parameter in an optimization region indicates thatit can at least provides 20 gains The horizontal arrow in-dicates the cut-off granularity at which a parameter provides20 gains

      parameter For example optimizing acceleration starts providingbenefits from 2KB while optimizing overheads or computationalindex is beneficial up till 32KB These arrows also indicate thecut-off granularity for each parameter These cut-off granularitiesprovide insights to architects and programmers about the designbottlenecks For example high cut-off granularity of 32KB suggestshigh overheads and thus a potential for optimization

      4 EXPERIMENTAL METHODOLOGYThis section describes the experimental setup and benchmarks forvalidating LogCA on real machines We also discuss our methodol-ogy for measuring LogCA parameters and performance metrics

      Our experimental setup comprises of on-chip and off-chip cryptoaccelerators (Table 2) and three different GPUs (Table 3) The on-chip crypto accelerators include cryptographic units on SunOracleUltraSPARC T2 [40] SPARC T3 [35] SPARC T4 [41] and AES-NI(AES New Instruction) [15] on Sandy Bridge whereas the off-chipaccelerator is a Hifn 7955 chip connected through the PCIe bus [43]The GPUs include a discrete NVIDIA GPU an integrated AMDGPU (APU) and HSA supported integrated GPU

      For the on-chip crypto accelerators each core in UltraSPARC T2and SPARC T3 has a physically addressed crypto unit which requiresprivileged DMA calls However the crypto unit on SPARC T4 isintegrated within the pipeline and does not require privileged DMAcalls SPARC T4 also provides non-privileged crypto instructions toaccess the crypto unit Similar to SPARC T4 sandy bridge providesnon-privileged crypto instructionmdashAESNI

      Considering the GPUs the discrete GPU is connected throughthe PCIe bus whereas for the APU the GPU is co-located with thehost processor on the same die For the APU the system memoryis partitioned between host and GPU memory This eliminates thePCIe bottleneck of data copying but it still requires copying databetween memories Unlike discrete GPU and APU HSA supportedGPU provides a unified and coherent view of the system memoryWith the host and GPU share the same virtual address space explicitcopying of data between memories is not required

      Our workloads consist of encryption hashing and GPU kernelsFor encryption and hashing we have used advanced encryptionstandard (AES) [30] and standard hashing algorithm (SHA) [31]respectively from OpenSSL [34]mdashan open source cryptography li-brary For GPU kernels we use matrix multiplication radix sortFFT and binary search from AMD OpenCL SDK [1] Table 4 we listthe complexities of each kernel both in terms of number of elementsn and granularity g We expect these complexities to remain same inboth cases but we observe that they differ for matrix multiplicationFor example for a square matrix of size n matrix multiplication hascomplexity of O (n3) whereas the complexity in terms of granularityis O (g17) This happens because for matrix multiplicationmdashunlikeothersmdashcomputations are performed on matrices and not vectorsSo offloading a square matrix of size n corresponds to offloading n2

      elements which results in the apparent discrepancy in the complexi-ties We also observe that for the granularity range of 16B to 32MBβ = 011 provides a close approximation for log(g)

      Table 3 Description of the GPUs

      Platform Discrete GPU Integrated APU AMD HSAName Tesla C2070 Radeon HD 6550 Radeon R7Architecture Fermi Beaver Creek KaveriCores 16 5 8Compute Units 448 400 512Clock Freq 15 GHz 600 MHz 720 MHzPeak FLOPS 1 T 480 G 856 GHostProcessor Intel AMD AMD

      Xeon E5520 A8-3850 A10-7850KFrequency GHz 227 29 17

      For calculating execution times we have used Linux utilities onthe crypto accelerators whereas for the GPUs we have used NVIDIAand AMD OpenCL profilers to compute the setup kernel and datatransfer times and we report the average of one hundred executionsFor verifying the usage of crypto accelerators we use built-in coun-ters in UltraSPARC T2 and T3 [46] SPARC T4 however no longer

      ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

      Table 4 Algorithmic complexity of various kernels with num-ber of elements and granularity The power of g represents β

      for each kernel

      Kernel Algorithmic ComplexityAdvanced Encryption Standard (AES) O (n) O (g101)Secure Hashing Algorithm (SHA) O (n) O (g097)Matrix Multiplication (GEMM) O (n3) O (g17)Fast Fourier Transform (FFT) O (n logn) O (g12)Radix Sort O (kn) O (g094)Binary Search O (logn) O (g014)

      Table 5 Calculated values of LogCA Parameters

      LogCA ParametersDevice Benchmark L o C A

      (cycles) (cycles) (cyclesB)

      Discrete GPU

      AES 174Radix Sort 290GEMM 3times103 2times108 2 30FFT 290Binary Search 116

      APU

      AES 174Radix Sort 290GEMM 15 4times108 2 7FFT 290Binary Search 116

      UltraSPARC T2 AES 1500 29times104 90 19SHA 105times103 72 12

      SPARC T3 AES 1500 27times104 90 12SHA 105times103 72 10

      SPARC T4 AES 500 435 32 12SHA 16times103 32 10

      SPARC T4 instr AES 4 111 32 12SHA 1638 32 10

      Sandy Bridge AES 3 10 35 6

      supports these counters so we use Linux utilities to trace the execu-tion of the crypto instructions [3] We use these execution times todetermine LogCA parameters We calculate these parameters onceand can be later used for different kernels on the same system

      For computational index and β we profile the CPU code on thehost by varying the granularity from 16B to 32MB At each granu-larity we measure the execution time and use regression analysisto determine C and β For overheads we use the observation thatfor very small granularities the execution time for a kernel on anaccelerator is dominated by the overheads ie limgrarr0 T1 (g) ≃ oFor acceleration we use different methods for the on-chip accelera-tors and GPUs For on-chip accelerators we calculate accelerationusing equation (3) and the observation that the speedup curve flat-tens out and approaches acceleration for very large granularitiesHowever for the GPUs we do not use equation (3) as it requirescomputing acceleration for each kernel as each application has adifferent access pattern which affects the speedup So we boundthe maximum performance using the peak flops from the devicespecifications We use the ratio of peak GFLOPs on CPU and GPUie A = Peak GFLOPGPU

      Peak GFLOPCPU Similar to acceleration we use two different

      techniques for calculating latency For the on-chip accelerators we

      run micro-benchmarks and use execution time on host and acceler-ators On the other hand for the GPUs we compute latency usingpeak memory bandwidth of the GPU Similar to Meswani et al [29]we use the following equation for measuring data copying time forthe GPUs L = 1

      BWpeak

      Earlier we develop our model using assumptions of granularityindependent and dependent latencies In our setup we observe thatthe on-chip crypto accelerators and HSA-enabled GPU representaccelerators with granularity independent latency while the off-chipcrypto accelerator and discrete GPUAPU represent the granular-ity dependent accelerators For each accelerator we calculate thespeedup and performance metrics using the respective equations(sect2)

      5 EVALUATIONIn this section we show that LogCA closely captures the behavior forboth off and on-chip accelerators We also list the calculate LogCAparameters in Table 5 To demonstrate the utility of our modelwe also present two case studies In these studies we consider theevolution of interface in SUNOraclersquos crypto accelerators and threedifferent GPU architectures In both cases we elaborate the designchanges using the insights LogCA provides

      51 Linear-Complexity Kernels (β = 1)Figure 7 shows the curve-fitting of LogCA for AES We considerboth off-chip and on-chip accelerators connected through differentinterfaces ranging from PCIe bus to special instructions We observethat the off-chip accelerators and APU unlike on-chip acceleratorsprovide reasonable speedup only at very large granularities We alsoobserve that the achievable speedup is limited by computationalintensity for off-chip accelerators and acceleration for on-chip accel-erators This observation supports earlier implication on the limitsof speedup for granularity independent and dependent latencies inequation (3) and (9) respectively

      Figure 7 also shows that UltraSPARC T2 provides higher speedupsthan Sandy Bridge but it breaks-even at a larger granularity SandyBridge on the other hand breaks-even at very small granularitybut provides limited speedup The discrete GPU with powerful pro-cessing cores has the highest acceleration among others Howeverits observed speedup is less than others due to high overheads andlatencies involved in communicating through the PCIe bus

      We have also marked g1 and g A2for each accelerator in Figure 7

      which help programmers and architects identify the complexity ofthe interface For example g1 for crypto instructions ie SPARCT4 and Sandy Bridge lies on the extreme left while for the off-chipaccelerators g1 lies on the far right It is worth mentioning that wehave marked g a

      2for on-chip accelerators but not for the off-chip

      accelerators For off-chip accelerators computational intensity isless than acceleration and as we have noted in equation (12) thatg A

      2for these designs does not existWe also observe that g1 for the crypto-card connected through

      the PCIe bus does not exist showing that this accelerator does notbreak-even even for large granularities Figure 7 also shows thatg1 for GPU and APU is comparable This observation shows thatdespite being an integrated GPU and not connected to the PCIe bus

      LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      CL

      Sp

      eed

      up

      (a) PCIe crypto

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100A

      g1

      CL

      (b) NVIDIA Discrete GPU

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      g1 gA2

      CL

      (c) AMD Integrated GPU (APU)

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      g1 gA2

      CL

      (d) UltraSPARC T2

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      g1 gA2

      CL

      Granularity (Bytes)

      Sp

      eed

      up

      (e) SPARC T3

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      g1 gA2

      CL

      Granularity (Bytes)

      (f) SPARC T4 engine

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      gA2

      CL

      g1 lt 16B

      Granularity (Bytes)

      (g) SPARC T4 instruction

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      CL

      g1 gA2lt 16B

      Granularity (Bytes)

      (h) AESNI on Sandy Bridge

      observed LogCA

      Figure 7 Speedup curve fittings plots comparing LogCA with the observed values of AES [30]

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      g1 gA2

      CL

      Granularity (Bytes)

      Sp

      eed

      up

      (a) UltraSPARC T2 engine

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      g1 gA2

      CL

      Granularity (Bytes)

      (b) SPARC T3 engine

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      g1 gA2

      CL

      Granularity (Bytes)

      (c) SPARC T4 engine

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      gA2

      g1 lt 16B

      CL

      Granularity (Bytes)

      (d) SPARC T4 instruction

      observed LogCA

      Figure 8 Speedup curve fittings plots comparing LogCA with the observed values of SHA256 [31] LogCA starts following observedvalues after 64B

      APU spends considerable time in copying data from the host todevice memory

      Figure 8 shows the curve fitting for SHA on various on-chipcrypto accelerators We observe that g1 and g A

      2do exist as all of

      these are on-chip accelerators We also observe that the LogCAcurve mostly follows the observed value However it deviates fromthe observed value before 64B This happens because SHA requiresblock size of 64B for hash computation If the block size is less than64B it pads extra bits to make the block size 64B Since LogCAdoes not capture this effect it does not follow the observed speedupfor granularity smaller than 64B

      Figure 9-a shows the speedup curve fitting plots for Radix sortWe observe that LogCA does not follow observed values for smallergranularities on GPU Despite this inaccuracy LogCA accuratelypredicts g1 and g A

      2 We also observe that g A

      2for GPU is higher than

      APU and this observation supports equation (7) that increasingacceleration increases g A

      2

      52 Super-Linear Complexity Kernels (β gt 1)Figures 9-b and 9-c show the speedup curve fitting plots for super-complexity kernels on discrete GPU and APU We observe that ma-trix multiplication with higher complexity (O (g17)) achieves higherspeedup than sort and FFT with lower complexities of O (g) andO (g12) respectively This observation corroborates results fromequation (9) that achievable speedup of higher-complexity algo-rithms is higher than lower-complexity algorithms We also observethat g A

      2does not exist for FFT This happens because as we note in

      equation (12) that for g A2to exist for FFT C

      L should be greater thanA

      12 However Figure 9-c shows that CL is smaller than A

      12 for bothGPU and APU

      53 Sub-Linear Complexity Kernels (β lt 1)Figure 9-d shows the curve fitting for binary search which is asub-linear algorithm (β = 014) We make three observations Firstg1 does not exist even for very large granularities and C

      L lt 1 Thisobservation supports implication (5) that for a sub-linear algorithm

      ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      g1 gA2

      CL

      Sp

      eed

      up

      GPU

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      g1 gA2

      CL

      APU

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      g1 gA2

      GPU

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      g1 gA2

      CL

      APU

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100A

      g1

      CL

      Granularity (Bytes)

      Sp

      eed

      up

      GPU

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      A

      g1

      CL

      Granularity (Bytes)

      APU

      16 128 1K 8K 64

      K51

      2K 4M 32M

      001

      01

      1

      10

      100A

      g1 gA2

      CL

      Granularity (Bytes)

      GPU

      16 128 1K 8K 64

      K51

      2K 4M 32M

      001

      01

      1

      10

      100

      A

      g1 gA2

      CL

      Granularity (Bytes)

      APU

      (a) Radix Sort (b) Matrix Multiplication

      (c) FFT (d) Binary Search

      observed LogCA

      Figure 9 Speedup curve fittings plots comparing LogCA with the observed values of (a) Radix Sort (b) Matrix Multiplication (c) FFTand (d) Binary Search

      of β = 014 CL should be greater than 7 to provide any speedup

      Second for large granularities speedup starts decreasing with anincrease in granularity This observation supports our earlier claimin implication (4) that for systems with granularity dependent la-tencies speedup for sub-linear algorithms asymptotically decreasesThird LogCA deviates from the observed value at large granularitiesThis deviation occurs because LogCA does not model caches Asmentioned earlier LogCA abstracts the caches and memories witha single parameter of latency which does not capture the memory-access pattern accurately Even though LogCA does not accuratelycaptures binary search behavior it still provides an upper bound onthe achievable performance

      54 Case StudiesFigure 10 shows the evolution of crypto accelerators in SPARCarchitectures from the off-chip accelerators in pre-Niagara (Figure 10(a)) to accelerators integrated within the pipeline in SPARC T4(Figure 10 (e)) We observe that latency is absent in the on-chipacceleratorsrsquo optimization regions as these accelerators are closelycoupled with the host We also note that the optimization regionwith overheadsmdashrepresenting the complexity of an acceleratorrsquosinterfacemdashshrinks while the optimization regions with accelerationexpand from Figure 10 (a-e) For example for the off-chip cryptoaccelerator the cut-off granularity for overheads is 256KB whereasit is 128B for the SPARC T4 suggesting a much simpler interface

      Figure 10 (a) shows the optimization regions for the off-chipcrypto accelerator connected through the PCIe bus We note thatoverheads and latencies occupy most of the optimization regionsindicating high overhead OS calls and high-latency data copyingover the PCIe bus as the bottlenecks

      Figure 10 (b) shows the optimization regions for UltraSPARCT2 The large cut-off granularity for overheads at 32KB suggestsa complex interface indicating high overhead OS call creating abottleneck at small granularities The cut-off granularity of 2KB foracceleration suggests that optimizing acceleration is beneficial atlarge granularities

      Figure 10 (d) shows optimization regions for on-chip acceleratoron SPARC T4 There are three optimization regions with the cut-offgranularity for overhead now reduced to only 512B This observationsuggests a considerable improvement in the interface design overSPARC T3 and it is also evident by a smaller g1 We also note thatcut-off granularity for acceleration now decreases to 32B showingan increase in the opportunity for optimizing acceleration

      Figure 10 (e) shows optimization regions for crypto instructionson SPARC T4 We observe that unlike earlier designs it has only twooptimization regions and the speedup approaches the peak accelera-tion at a small granularity of 128B In contrast UltraSPARC T2 andSPARC T3 do not even provide any gains at this granularity We alsoobserve that the cut-off granularity for overheads further reduces to128B suggesting some opportunity for optimization at very smallgranularities The model also shows that the acceleration occupiesthe maximum range for optimization For example optimizing accel-eration provides benefits for granularities greater than 16B The lowoverhead access which LogCA shows is due to the non-privilegedinstruction SPARC T4 uses to access the cryptographic unit whichis integrated within the pipeline

      Figure 11 shows the evolution of memory interface design inGPU architectures It shows the optimization regions for matrixmultiplication on a discrete NVIDIA GPU an AMD integrated GPU(APU) and an integrated AMD GPU with HSA support We observethat matrix multiplication for all three architectures is compute bound

      LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      1000

      A

      CL

      oCLoC

      LC

      oL

      Granularity (Bytes)

      Sp

      eed

      up

      (a) PCIe Crypto Accelerator

      16 128 g1 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      1000

      A

      CL

      oC AoCA

      oA

      Granularity (Bytes)

      (b) UltraSPARC T2

      16 128 g1 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      1000CL

      A

      oC AoCA

      oA

      Granularity (Bytes)

      (c) SPARC T3

      g112

      8 1K 8K 64K

      512K 4M 32

      M

      01

      1

      10

      100

      1000

      A

      oCA A

      CL

      oA

      Granularity (Bytes)

      Sp

      eed

      up

      (d) SPARC T4 engine

      16 128 1K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      1000

      A

      oCA A

      CL

      oA

      Granularity (Bytes)

      (e) SPARC T4 instruction

      LogCA L110xo110x C10x A10x

      Figure 10 LogCA for performing Advanced Encryption Standard on various crypto accelerators LogCA identifies the design bottle-necks through LogCA parameters in an optimization region The bottlenecks which LogCA suggests in each design is optimized inthe next design

      16 128 1K 8K g1

      64K

      512K 4M 32

      M

      01

      1

      10

      100

      1000

      A

      LoC LCALC A

      ALC

      o

      Granularity (Bytes)

      Sp

      eed

      up

      (a) NVIDIA Discrete GPU

      16 128 1K 8K g1

      64K

      512K 4M 32

      M

      01

      1

      10

      100

      1000

      A

      LoCLo

      CA

      ACA

      AoL

      C

      Granularity (Bytes)

      (b) AMD Integrated GPU (APU)

      16 128 g11K 8K 64

      K51

      2K 4M 32M

      01

      1

      10

      100

      1000

      A

      oC o

      CA

      CA

      o

      AC

      Granularity (Bytes)

      (c) HSA supported AMD Integrated GPU

      LogCA L110xo110x C10x A10x

      Figure 11 Various Optimization regions for matrix multiplication over a range of granularities on (a) NVIDIA discrete GPU (b)AMD APU and (c) HSA Supported GPU

      (sect31) We also observe that the computational index occupies mostof the regions which signifies maximum optimization potential

      The discrete GPU has four optimization regions (Figure 11 (a))Among these latency dominates most of the regions signifyinghigh-latency data copying over the PCIe bus and thus maximumoptimization potential The high cut-off granularity for overheads at

      32KB indicates high overhead OS calls to access the GPU Similarlywith highly aggressive cores acceleration has high cut-off granular-ity of 256KB indicating less optimization potential for acceleration

      Similar to the discrete GPU the APU also has four optimiza-tion regions (Figure 11 (b)) There are few notable differences ascompared to the discrete GPU The cut-off granularity for latency

      ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

      reduces to 512KB with the elimination of data copying over thePCIe bus the overheads are still high suggesting high overhead OScalls to access the APU with less aggressive cores the cut-off granu-larity for acceleration reduces to 64KB implying more optimizationpotential for acceleration

      Figure 11 (c) shows three optimization regions for the HSA en-abled integrated GPU We observe that latency is absent in all regionsand the cut-off granularity for overhead reduces to 8KB These re-ductions in overheads and latencies signify a simpler interface ascompared to the discrete GPU and APU We also observe that thecut-off granularity for acceleration drops to 2KB suggesting higherpotential for optimizing acceleration

      6 RELATED WORKWe compare and contrast our work with prior approaches Lopez-Novoa et al [28] provide a detailed survey of various acceleratormodeling techniques We broadly classify these techniques in twocategories and discuss the most relevant work

      Analytical Models There is a rich body of work exploring ana-lytical models for performance prediction of accelerators For somemodels the motivation is to determine the future trend in heteroge-neous architectures Chung et al [7] in a detailed study predict thefuture landscape of heterogeneous computing Hempstead et al [17]propose an early-stage model Navigo that determines the fraction ofarea required for accelerators to maintain the traditional performancetrend Nilakantan et al [32] propose to incorporate communicationcost for early-stage model of accelerator-rich architectures For oth-ers the motivation is to determine the right amount of data to offloadand the potential benefits associated with an accelerator [24]

      Some analytical models are architecture specific For examplea number of studies [20 21 44 57] predict performance of GPUarchitectures Hong et al [20] present an analytical performancemodel for predicting execution time on GPUs They later extendtheir model and develop an integrated power and performance modelfor the GPUs [21] Song et al [44] use a simple counter basedapproach to predict power and performance Meswani et al [29]explore such models for high performance applications Daga etal [11] analyze the effectiveness of Accelerated processing units(APU) over GPUs and describe the communication cost over thePCIe bus as a major bottleneck in exploiting the full potential ofGPUs

      In general our work is different from these studies because ofthe complexity These models use a large number of parameters toaccurately predict the power andor performance whereas we limitthe number of parameters to reduce the complexity of our modelThey also require deep understanding of the underlying architectureMost of these models also require access to GPU specific assemblyor PTX codes Unlike these approaches we use CPU code to providebounds on the performance

      Roofline Models In terms of simplicity and motivation our workclosely matches the Roofline model [54]mdasha visual performancemodel for multi-core architectures Roofline exposes bottlenecks fora kernel and suggests several optimizations which programmers canuse to fine tune the kernel on a given system

      A number of extensions of Roofline have been proposed [1022 33 56] and some of these extensions are architecture specific

      For example targeting GPUs [22] vector processors [39] and FP-GAs [10 56]

      Despite the similarities roofline and its extensions cannot be usedfor exposing design bottlenecks in an acceleratorrsquos interface Theprimary goal of roofline models has been to help programmers andcompiler writer while LogCA provides more insights for architects

      7 CONCLUSION AND FUTURE WORKWith the recent trend towards heterogeneous computing we feelthat the architecture community lacks a model to reason about theneed of accelerators In this respect we propose LogCAmdashan insight-ful visual performance model for hardware accelerators LogCAprovides insights early in the design stage to both architects andprogrammers and identifies performance bounds exposes interfacedesign bottlenecks and suggest optimizations to alleviate these bot-tlenecks We have validated our model across a range of on-chip andoff-chip accelerators and have shown its utility using retrospectivestudies describing the evolution of acceleratorrsquos interface in thesearchitectures

      The applicability of LogCA can be limited by our simplifying as-sumptions and for more realistic analysis we plan to overcome theselimitations in our future work For example We also assume a singleaccelerator system and do not explicitly model contention amongresources Our model should handle multi-accelerator and pipelinedscenarios For fixed function accelerators our design space is cur-rently limited to encryption and hashing kernels To overcome thiswe are expanding our design space with compression and databaseaccelerators in Oracle M7 processor We also plan to complementLogCA with an energy model as energy efficiency is a prime designmetric for accelerators

      ACKNOWLEDGEMENTSWe thank our anonymous reviewers Arkaprava Basu and TonyNowatzki for their insightful comments and feedback on the paperThanks to Mark Hill Michael Swift Wisconsin Computer Archi-tecture Affiliates and other members of the Multifacet group fortheir valuable discussions We also thank Brian Wilson at Universityof Wisconsin DoIT and Eric Sedlar at Oracle Labs for providingaccess to SPARC T3 and T4 servers respectively Also thanks toMuhammad Umair Bin Altaf for his help in the formulation Thiswork is supported in part by the National Science Foundation (CNS-1302260 CCF-1438992 CCF-1533885 CCF- 1617824) Googleand the University of Wisconsin-Madison (Amar and Balindar SohiProfessorship in Computer Science) Wood has a significant financialinterest in AMD and Google

      REFERENCES[1] Advanced Micro Devices 2016 APP SDK - A Complete Development Platform

      Advanced Micro Devices httpdeveloperamdcomtools-and-sdksopencl-zoneamd-accelerated-parallel-processing-app-sdk

      [2] Gene M Amdahl 1967 Validity of the single processor approach to achievinglarge scale computing capabilities Proceedings of the April 18-20 1967 springjoint computer conference on - AFIPS rsquo67 (Spring) (1967) 483 httpsdoiorg10114514654821465560

      [3] Dan Anderson 2012 How to tell if SPARC T4 crypto is being used httpsblogsoraclecomDanXentryhow_to_tell_if_sparc

      [4] Krste Asanovic Rastislav Bodik James Demmel Tony Keaveny Kurt KeutzerJohn Kubiatowicz Nelson Morgan David Patterson Koushik Sen John

      LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

      Wawrzynek David Wessel and Katherine Yelick 2009 A View of the Par-allel Computing Landscape Commun ACM 52 10 (oct 2009) 56ndash67 httpsdoiorg10114515627641562783

      [5] Nathan Beckmann and Daniel Sanchez 2016 Cache Calculus Modeling Cachesthrough Differential Equations Computer Architecture Letters PP 99 (2016)1 httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=7366753$delimiter026E30F$npapers3publicationdoi101109LCA20152512873

      [6] C Cascaval S Chatterjee H Franke K J Gildea and P Pattnaik 2010 Ataxonomy of accelerator architectures and their programming models IBMJournal of Research and Development 54 (2010) 51ndash510 httpsdoiorg101147JRD20102059721

      [7] Eric S Chung Peter a Milder James C Hoe and Ken Mai 2010 Single-ChipHeterogeneous Computing Does the Future Include Custom Logic FPGAs andGPGPUs 2010 43rd Annual IEEEACM International Symposium on Microar-chitecture (dec 2010) 225ndash236 httpsdoiorg101109MICRO201036

      [8] Jason Cong Zhenman Fang Michael Gill and Glenn Reinman 2015 PARADEA Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Archi-tectural Design and Exploration In 2015 IEEEACM International Conference onComputer-Aided Design Austin TX

      [9] D Culler R Karp D Patterson and A Sahay 1993 LogP Towards a realisticmodel of parallel computation In Proceedings of the Fourth ACM SIGPLANSymposium on Principles and Practice of Parallel Programming 1ndash12 httpdlacmorgcitationcfmid=155333

      [10] Bruno da Silva An Braeken Erik H DrsquoHollander and Abdellah Touhafi 2013Performance Modeling for FPGAs Extending the Roofline Model with High-level Synthesis Tools Int J Reconfig Comput 2013 (jan 2013) 77mdash-77httpsdoiorg1011552013428078

      [11] Mayank Daga Ashwin M Aji and Wu-chun Feng 2011 On the Efficacy of aFused CPU+GPU Processor (or APU) for Parallel Computing In 2011 Symposiumon Application Accelerators in High-Performance Computing Ieee 141ndash149httpsdoiorg101109SAAHPC201129

      [12] Hadi Esmaeilzadeh Emily Blem Renee St Amant Karthikeyan Sankaralingamand Doug Burger 2011 Dark silicon and the end of multicore scaling InProceeding of the 38th annual international symposium on Computer archi-tecture - ISCA rsquo11 ACM Press New York New York USA 365 httpsdoiorg10114520000642000108

      [13] H Franke J Xenidis C Basso B M Bass S S Woodward J D Brown andC L Johnson 2010 Introduction to the wire-speed processor and architectureIBM Journal of Research and Development 54 (2010) 31ndash311 httpsdoiorg101147JRD20092036980

      [14] Venkatraman Govindaraju Chen Han Ho and Karthikeyan Sankaralingam 2011Dynamically specialized datapaths for energy efficient computing In Proceedings- International Symposium on High-Performance Computer Architecture 503ndash514httpsdoiorg101109HPCA20115749755

      [15] Shay Gueron 2012 Intel Advanced Encryption Standard (AES) Instructions SetTechnical Report Intel Corporation httpssoftwareintelcomsitesdefaultfilesarticle165683aes-wp-2012-09-22-v01pdf

      [16] Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex SolomatnikovBenjamin C Lee Stephen Richardson Christos Kozyrakis and Mark Horowitz2010 Understanding sources of inefficiency in general-purpose chips Proceed-ings of the 37th annual international symposium on Computer architecture - ISCA

      rsquo10 (2010) 37 httpsdoiorg10114518159611815968[17] Mark Hempstead Gu-Yeon Wei and David Brooks 2009 Navigo An early-

      stage model to study power-constrained architectures and specialization In ISCAWorkshop on Modeling Benchmarking and Simulations (MoBS)

      [18] John L Hennessy and David A Patterson 2006 Computer Architecture FourthEdition A Quantitative Approach 704 pages httpsdoiorg10111151881

      [19] Mark D Hill and Michael R Marty 2008 Amdahlrsquos Law in the Multicore EraComputer 41 7 (jul 2008) 33ndash38 httpsdoiorg101109MC2008209

      [20] Sunpyo Hong and Hyesoon Kim 2009 An analytical model for a GPU architec-ture with memory-level and thread-level parallelism awareness In Proceedingsof the 36th Annual International Symposium on Computer Architecture Vol 37152ndash163 httpsdoiorg10114515558151555775

      [21] Sunpyo Hong and Hyesoon Kim 2010 An integrated GPU power and perfor-mance model In Proceedings of the 37th Annual International Symposium onComputer Architecture Vol 38 280mdash-289 httpsdoiorg10114518160381815998

      [22] Haipeng Jia Yunquan Zhang Guoping Long Jianliang Xu Shengen Yan andYan Li 2012 GPURoofline A Model for Guiding Performance Optimizations onGPUs In Proceedings of the 18th International Conference on Parallel Processing(Euro-Parrsquo12) Springer-Verlag Berlin Heidelberg 920ndash932 httpsdoiorg101007978-3-642-32820-6_90

      [23] Onur Kocberber Boris Grot Javier Picorel Babak Falsafi Kevin Lim andParthasarathy Ranganathan 2013 Meet the Walkers Accelerating Index Traver-sals for In-memory Databases In Proceedings of the 46th Annual IEEEACMInternational Symposium on Microarchitecture (MICRO-46) ACM New YorkNY USA 468ndash479 httpsdoiorg10114525407082540748

      [24] Karthik Kumar Jibang Liu Yung Hsiang Lu and Bharat Bhargava 2013 Asurvey of computation offloading for mobile systems Mobile Networks andApplications 18 (2013) 129ndash140 httpsdoiorg101007s11036-012-0368-0

      [25] Snehasish Kumar Naveen Vedula Arrvindh Shriraman and Vijayalakshmi Srini-vasan 2015 DASX Hardware Accelerator for Software Data Structures InProceedings of the 29th ACM on International Conference on Supercomputing(ICS rsquo15) ACM New York NY USA 361ndash372 httpsdoiorg10114527512052751231

      [26] Maysam Lavasani Hari Angepat and Derek Chiou 2014 An FPGA-based In-Line Accelerator for Memcached IEEE Comput Archit Lett 13 2 (jul 2014)57ndash60 httpsdoiorg101109L-CA201317

      [27] John D C Little and Stephen C Graves 2008 Littlersquos law In Building intuitionSpringer 81ndash100

      [28] U Lopez-Novoa A Mendiburu and J Miguel-Alonso 2015 A Survey of Perfor-mance Modeling and Simulation Techniques for Accelerator-Based ComputingParallel and Distributed Systems IEEE Transactions on 26 1 (jan 2015) 272ndash281httpsdoiorg101109TPDS20142308216

      [29] M R Meswani L Carrington D Unat A Snavely S Baden and S Poole2013 Modeling and predicting performance of high performance computingapplications on hardware accelerators International Journal of High Perfor-mance Computing Applications 27 (2013) 89ndash108 httpsdoiorg1011771094342012468180

      [30] National Institute of Standards and Technology 2001 Advanced EncryptionStandard (AES) National Institute of Standards and Technology httpsdoiorg106028NISTFIPS197

      [31] National Institute of Standards and Technology 2008 Secure Hash StandardNational Institute of Standards and Technology httpcsrcnistgovpublicationsfipsfips180-3fips180-3_finalpdf

      [32] S Nilakantan S Battle and M Hempstead 2013 Metrics for Early-Stage Model-ing of Many-Accelerator Architectures Computer Architecture Letters 12 1 (jan2013) 25ndash28 httpsdoiorg101109L-CA20129

      [33] Cedric Nugteren and Henk Corporaal 2012 The Boat Hull Model EnablingPerformance Prediction for Parallel Computing Prior to Code Development Cate-gories and Subject Descriptors In Proceedings of the 9th Conference on Comput-ing Frontiers ACM 203mdash-212

      [34] OpenSSL Software Foundation 2015 OpenSSL Cryptography and SSLTLSToolkit OpenSSL Software Foundation httpsopensslorg

      [35] Sanjay Patel 2009 Sunrsquos Next-Generation Multithreaded Processor RainbowFalls In 21st Hot Chip Symposium httpwwwhotchipsorgwp-contentuploadshc

      [36] Sanjay Patel and Wen-mei W Hwu 2008 Accelerator Architectures IEEE Micro28 4 (jul 2008) 4ndash12 httpsdoiorg101109MM200850

      [37] Stephen Phillips 2014 M7 Next Generation SPARC In 26th Hot Chip Sympo-sium

      [38] Phil Rogers 2013 Heterogeneous system architecture overview In Hot Chips[39] Yoshiei Sato Ryuichi Nagaoka Akihiro Musa Ryusuke Egawa Hiroyuki Tak-

      izawa Koki Okabe and Hiroaki Kobayashi 2009 Performance tuning andanalysis of future vector processors based on the roofline model Proceedingsof the 10th MEDEA workshop on MEmory performance DEaling with Applica-tions systems and architecture - MEDEA rsquo09 (2009) 7 httpsdoiorg10114516219601621962

      [40] M Shah J Barren J Brooks R Golla G Grohoski N Gura R Hetherington PJordan M Luttrell C Olson B Sana D Sheahan L Spracklen and A Wynn2007 UltraSPARC T2 A highly-treaded power-efficient SPARC SOC InSolid-State Circuits Conference 2007 ASSCC rsquo07 IEEE Asian 22ndash25 httpsdoiorg101109ASSCC20074425786

      [41] Manish Shah Robert Golla Gregory Grohoski Paul Jordan Jama Barreh JeffreyBrooks Mark Greenberg Gideon Levinsky Mark Luttrell Christopher OlsonZeid Samoail Matt Smittle and Thomas Ziaja 2012 Sparc T4 A dynamicallythreaded server-on-a-chip IEEE Micro 32 (2012) 8ndash19 httpsdoiorg101109MM20121

      [42] Yakun Sophia Shao Brandon Reagen Gu-Yeon Wei and David Brooks 2014Aladdin A Pre-RTL Power-Performance Accelerator Simulator Enabling LargeDesign Space Exploration of Customized Architectures In International Sympo-sium on Computer Architecture (ISCA)

      [43] Soekris Engineering 2016 vpn 1401 for Std PCI-sockets Soekris Engineeringhttpsoekriscomproductsvpn-1401html

      [44] Shuaiwen Song Chunyi Su Barry Rountree and Kirk W Cameron 2013 Asimplified and accurate model of power-performance efficiency on emergent GPUarchitectures In Proceedings - IEEE 27th International Parallel and DistributedProcessing Symposium IPDPS 2013 673ndash686 httpsdoiorg101109IPDPS201373

      [45] Jeff Stuecheli 2013 POWER8 In 25th Hot Chip Symposium[46] Ning Sun and Chi-Chang Lin 2007 Using the Cryptographic Accelerators in the

      UltraSPARC T1 and T2 processors Technical Report httpwwworaclecomtechnetworkserver-storagesolarisdocumentation819-5782-150147pdf

      ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

      [47] S Tabik G Ortega and E M Garzoacuten 2014 Performance evaluation of ker-nel fusion BLAS routines on the GPU iterative solvers as case study TheJournal of Supercomputing 70 2 (nov 2014) 577ndash587 httpsdoiorg101007s11227-014-1102-4

      [48] Y C Tay 2013 Analytical Performance Modeling for Computer Systems (2nded) Morgan amp Claypool Publishers

      [49] MB Taylor 2012 Is dark silicon useful harnessing the four horsemen of thecoming dark silicon apocalypse In Design Automation Conference (DAC) 201249th ACMEDACIEEE 1131ndash1136 httpsdoiorg10114522283602228567

      [50] G Venkatesh J Sampson N Goulding S Garcia V Bryksin J Lugo-MartinezS Swanson and M B Taylor 2010 Conservation cores Reducing the energyof mature computations In International Conference on Architectural Supportfor Programming Languages and Operating Systems - ASPLOS 205ndash218 httpsdoiorg10114517360201736044

      [51] Ganesh Venkatesh Jack Sampson Nathan Goulding-Hotta Sravanthi KotaVenkata Michael Bedford Taylor and Steven Swanson 2011 QsCores Trad-ing Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores InProceedings of the 44th Annual IEEEACM International Symposium on Microar-chitecture - MICRO-44 rsquo11 163 httpsdoiorg10114521556202155640

      [52] Guibin Wang Yisong Lin and Wei Yi 2010 Kernel Fusion An EffectiveMethod for Better Power Efficiency on Multithreaded GPU In Green Computingand Communications (GreenCom) 2010 IEEEACM Intrsquol Conference on CyberPhysical and Social Computing (CPSCom) 344ndash350 httpsdoiorg101109GreenCom-CPSCom2010102

      [53] Eric W Weisstein 2015 Newtonrsquos Method From MathWorld ndash A Wolfram WebResource httpmathworldwolframcomNewtonsMethodhtml

      [54] Samuel Williams Andrew Waterman and David Patterson 2009 Roofline aninsightful visual performance model for multicore architectures Commun ACM52 (2009) 65ndash76 httpsdoiorg10114514987651498785

      [55] Lisa Wu Andrea Lottarini Timothy K Paine Martha A Kim and Kenneth ARoss 2014 Q100 The Architecture and Design of a Database Processing UnitIn Proceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS rsquo14) ACM NewYork NY USA 255ndash268 httpsdoiorg10114525419402541961

      [56] Moein Pahlavan Yali 2014 FPGA-Roofline An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems PhD DissertationVirginia Polytechnic Institute and State University

      [57] Yao Zhang and John D Owens 2011 A quantitative performance analysismodel for GPU architectures In Proceedings - International Symposium on High-Performance Computer Architecture 382ndash393 httpsdoiorg101109HPCA20115749745

      • Abstract
      • 1 Introduction
      • 2 The LogCA Model
        • 21 Effect of Granularity
        • 22 Performance Metrics
        • 23 Granularity dependent latency
          • 3 Applications of LogCA
            • 31 Performance Bounds
            • 32 Sensitivity Analysis
              • 4 Experimental Methodology
              • 5 Evaluation
                • 51 Linear-Complexity Kernels (= 1)
                • 52 Super-Linear Complexity Kernels (gt 1)
                • 53 Sub-Linear Complexity Kernels (lt 1)
                • 54 Case Studies
                  • 6 Related Work
                  • 7 Conclusion and Future Work
                  • References

        ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

        g1 gA2

        1

        A2

        A

        Granularity (Bytes)

        Sp

        eed

        up

        (g)

        Figure 3 A graphical description of the performance metrics

        one extreme for large granularities equation (2) becomes

        limgrarrinfin

        Speedup(g) = A (3)

        While for small granularities equation (2) reduces to

        limgrarr0

        Speedup(g) ≃ Co+L+ C

        Alt

        Co+L

        (4)

        Equation (4) is simply Amdahlrsquos Law [2] for accelerators demon-strating the dominating effect of overheads at small granularities

        22 Performance MetricsTo help programmers decide when and how much computation tooffload we define two performance metrics These metrics are in-spired by the vector machine metrics Nv and N12[18] where Nvis the vector length to make vector mode faster than scalar modeand N12 is the vector length to achieve half of the peak perfor-mance Since vector length is an important parameter in determiningperformance gains for vector machines these metrics characterizethe behavior and efficiency of vector machines with reference toscalar machines Our metrics tend to serve the same purpose in theaccelerator domain

        g1 The granularity to achieve a speedup of 1 (Figure 3) It isthe break-even point where the acceleratorrsquos performance becomesequal to the host Thus it is the minimum granularity at which anaccelerator starts providing benefits Solving equation (2) for g1gives

        g1 =

        [(A

        Aminus1

        )lowast(

        o+LC

        )] 1β

        (5)

        IMPLICATION 1 g1 is essentially independent of accelerationfor large values of lsquoArsquo

        For reducing g1 the above implication guides an architect toinvest resources in improving the interface

        IMPLICATION 2 Doubling computational index reduces g1 by

        2minus1β

        The above implication demonstrates the effect of algorithmiccomplexity on g1 and shows that varying computational index has aprofound effect on g1 for sub-linear algorithms For example for asub-linear algorithm with β = 05 doubling the computational indexdecreases g1 by a factor of four However for linear (β = 1) andquadratic (β = 2) algorithms g1 decreases by factors of two and

        radic2

        respectively

        g A2 The granularity to achieve a speedup of half of the acceler-

        ation This metric provides information about a systemrsquos behaviorafter the break-even point and shows how quickly the speedup canramp towards acceleration Solving equation (2) for g A

        2gives

        g A2=

        [Alowast(

        o+LC

        )] 1β

        (6)

        Using equation (5) and (6) g1 and g A2are related as

        g A2= (Aminus1)

        1β lowastg1 (7)

        IMPLICATION 3 Doubling acceleration lsquoArsquo increases the gran-

        ularity to attain A2 by 2

        The above implication demonstrates the effect of accelerationon g A

        2and shows that this effect is more pronounced for sub-linear

        algorithms For example for a sub-linear algorithm with β = 05doubling acceleration increases g A

        2by a factor of four However for

        linear and quadratic algorithms g A2increases by factors of two and

        radic2 respectivelyFor architects equation (7) also exposes an interesting design

        trade-off between acceleration and performance metrics Typicallyan architect may prefer higher acceleration and lower g1 g A

        2 How-

        ever equation (7) shows that increasing acceleration also increasesg A

        2 This presents a dilemma for an architect to favor either higher

        acceleration or reduced granularity especially for sub-linear algo-rithms LogCA helps by exposing these trade-offs at an early designstage

        In our model we also use g1 to determine the complexity of thesystemrsquos interface A lower g1 (on the left side of plot in Figure 3)is desirable as it implies a system with lower overheads and thus asimpler interface Likewise g1 increases with the complexity of theinterface or when an accelerator moves further away from the host

        23 Granularity dependent latencyThe previous section assumed latency is granularity independent butwe have observed granularity dependent latencies in GPUs In thissection we discuss the effect of granularity on speedup and deriveperformance metrics assuming granularity dependent-latency

        Assuming granularity dependent latency equation (1) reduces to

        Speedup(g) =C lowastgβ

        o+Llowastg+ Clowastgβ

        A

        (8)

        For large granularities equation (8) reduces to

        limgrarrinfin

        Speedup(g) =

        (A

        AClowastgβ

        lowast (Llowastg)+1

        )lt

        CLlowastgβminus1 (9)

        Unlike equation (3) speedup in the above equation approachesCL lowastgβminus1 at large granularities Thus for linear algorithms with gran-ularity dependent latency instead of acceleration speedup is limitedby C

        L However for super-linear algorithms this limit increases by afactor of gβminus1 whereas for sub-linear algorithms this limit decreasesby a factor of gβminus1

        LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

        IMPLICATION 4 With granularity dependent latency the speedupfor sub-linear algorithms asymptotically decreases with the increasein granularity

        The above implication suggests that for sub-linear algorithms onsystems with granularity dependent latency speedup may decreasefor some large granularities This happens because for large granu-larities the communication latency (a linear function of granularity)may be higher than the computation time (a sub-linear function ofgranularity) on the accelerator resulting in a net de-accelerationThis implication is surprising as earlier we observed thatmdashfor sys-tems with granularity independent latencymdashspeedup for all algo-rithms increase with granularity and approaches acceleration forvery large granularities

        For very small granularities equation (8) reduces to

        limgrarr 0

        Speedup(g) ≃ Alowast CAlowast (o+L)+C

        (10)

        Similar to equation (4) the above equation exposes the increasingeffects of overheads at small granularities Solving equation (8) forg1 using Newtonrsquos method [53]

        g1 =C lowast (β minus1) lowast (Aminus1)+Alowasto

        C lowastβ lowast (Aminus1)minusAlowastL(11)

        For a positive value of g1 equation (11) must satisfy CL gt 1

        β

        Thus for achieving any speedup for linear algorithms CL should

        be at least 1 However for super-linear algorithms a speedup of 1can achieved at values of C

        L smaller than 1 whereas for sub-linearalgorithms algorithms C

        L must be greater than 1

        IMPLICATION 5 With granularity dependent latency computa-tional intensity for sub-linear algorithms should be greater than 1to achieve any gains

        Thus for sub-linear algorithms computational index has to begreater than latency to justify offloading the work However forhigher-complexity algorithms computational index can be quitesmall and still be potentially useful to offload

        Similarly solving equation (8) using Newtonrsquos method for g A2

        gives

        g A2=

        C lowast (β minus1)+AlowastoC lowastβ minusAlowastL

        (12)

        For a positive value of g A2 equation (12) must satisfy CA

        L gt 1β

        Thus for achieving a speedup of A2 CL should be at least lsquoArsquo for

        linear algorithms However for super-linear algorithms a speedupof A

        2 can achieved at values of CL smaller than lsquoArsquo whereas for

        sub-linear algorithms CL must be greater than lsquoArsquo

        IMPLICATION 6 With granularity dependent latency accelera-torrsquos computational intensity for sub-linear algorithms should begreater than 1 to achieve speedup of half of the acceleration

        The above implication suggests that for achieving half of theacceleration with sub-linear algorithms the computation time on theaccelerator must be greater than latency However for super-linearalgorithms that speedup can be achieved even if the computationtime on accelerator is lower than latency Programmers can usethe above implications to determinemdashearly in the design cyclemdashwhether to put time and effort in porting a code to an accelerator

        g1

        1

        A

        CL

        limgrarrinfin Speedup(g) = A

        CL gt A

        Sp

        eed

        up

        g1

        1

        CL

        A

        limgrarrinfin Speedup(g) = A

        CL lt A

        g1

        1

        CL

        A

        limgrarrinfin Speedup(g) = CL

        CL lt A

        Granularity (Bytes)

        Sp

        eed

        up

        g1

        1

        CL

        A

        limgrarrinfin Speedup(g) lt CL

        CL lt A

        Granularity (Bytes)

        (a) Performance bounds for compute-bound kernels

        (b) Performance bounds for latency-bound kernels

        Figure 4 LogCA helps in visually identifying (a) compute and(b) latency bound kernels

        For example consider a system with a minimum desirable speedupof one half of the acceleration but has a computational intensity ofless than the acceleration With the above implication architectsand programmers can infer early in the design stage that the desiredspeedup can not be achieved for sub-linear and linear algorithmsHowever the desired speedup can be achieved with super-linearalgorithms

        We are also interested in quantifying the limits on achievablespeedup due to overheads and latencies To do this we assume ahypothetical accelerator with infinite acceleration and calculate thegranularity (gA) to achieve the peak speedup of lsquoArsquo With this as-sumption the desired speedup of lsquoArsquo is only limited by the overheadsand latencies Solving equation (8) for gA gives

        gA =C lowast (β minus1)+Alowasto

        C lowastβ minusAlowastL(13)

        Surprisingly we find that the above equation is similar to equa-tion (12) ie gA equals g A

        2 This observation shows that with a

        hypothetical accelerator the peak speedup can now be achieved atthe same granularity as g A

        2 This observation also demonstrates that

        if g A2is not achievable on a system ie CA

        L lt 1β

        as per equation(12) then despite increasing the acceleration gA will not be achiev-able and the speedup will still be bounded by the computationalintensity

        IMPLICATION 7 If a speedup of A2 is not achievable on an ac-

        celerator with acceleration lsquoArsquo despite increasing acceleration toAtilde (where Atilde gt A) the speedup is bounded by the computationalintensity

        The above implication helps architects in allocating more re-sources for an efficient interface instead of increasing acceleration

        ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        1000

        A

        No variation

        Granularity (Bytes)

        Sp

        eed

        up

        (a) Latency

        LogCAL110x

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        1000

        A

        Granularity (Bytes)

        (b) Overheads

        LogCAo110x

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        1000

        A

        Granularity (Bytes)

        (c) Computational Index

        LogCAC10x

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        1000

        A

        CL

        Granularity (Bytes)

        (d) Acceleration

        LogCAA10x

        Figure 5 The effect on speedup of 10x improvement in each LogCA parameter The base case is the speedup of AES [30] on Ultra-SPARC T2

        3 APPLICATIONS OF LogCAIn this section we describe the utility of LogCA for visually iden-tifying the performance bounds design bottlenecks and possibleoptimizations to alleviate these bottlenecks

        31 Performance BoundsEarlier we have observed that the speedup is bounded by eitheracceleration (equation 3) or the product of computational intensityand gβminus1 (equation 9) Using these observations we classify kernelseither as compute-bound or latency-bound For compute-bound ker-nels the achievable speedup is bounded by acceleration whereas forthe latency-bound kernels the speedup is bounded by computationalintensity Based on this classification a compute-bound kernel caneither be running on a system with granularity independent latencyor has super-linear complexity while running on a system with gran-ularity dependent latency Figure 4-a illustrates these bounds forcompute-bound kernels On the other hand a latency-bound kernelis running on a system with granularity dependent latency and haseither linear or sub-linear complexity Figure 4-b illustrates thesebounds for latency-bound kernels

        Programmers and architects can visually identify these boundsand use this information to invest their time and resources in the rightdirection For example for compute-bound kernelsmdashdependingon the operating granularitymdashit may be beneficial to invest moreresources in either increasing acceleration or reducing overheadsHowever for latency-bound kernels optimizing acceleration andoverheads is not that critical but decreasing latency and increasingcomputational index maybe more beneficial

        32 Sensitivity AnalysisTo identify the design bottlenecks we perform a sensitivity analysisof the LogCA parameters We consider a parameter a design bottle-neck if a 10x improvement in it provides at lest 20 improvement inspeedup A lsquobottleneckedrsquo parameter also provides an optimizationopportunity To visually identify these bottlenecks we introduceoptimization regions As an example we identify design bottlenecksin UltraSPARC T2rsquos crypto accelerator by varying its individualparameters 2 in Figure 5 (a)-(d)

        2We elaborate our methodology for measuring LogCA parameters later (sect 4)

        Figure 5 (a) shows the variation (or the lack of) in speedup withthe decrease in latency The resulting gains are negligible and inde-pendent of the granularity as it is a closely coupled accelerator

        Figure 5 (b) shows the resulting speedup after reducing overheadsSince the overheads are one-time initialization cost and independentof granularity the per byte setup cost is high at small granularitiesDecreasing these overheads considerably reduces the per byte setupcost and results in significant gains at these smaller granularitiesConversely for larger granularities the per byte setup cost is alreadyamortized so reducing overheads does not provide much gainsThus overhead is a bottleneck at small granularities and provide anopportunity for optimization

        Figure 5 (c) shows the effect of increasing the computationalindex The results are similar to optimizing overheads in Figure 5 (b)ie significant gains for small granularities and a gradual decreasein the gains with increasing granularity With the constant overheadsincreasing computational index increases the computation time of thekernel and decreases the per byte setup cost For smaller granularitiesthe reduced per byte setup cost results in significant gains

        Figure 5 (d) shows the variation in speedup with increasing peakacceleration The gains are negligible at small granularities andbecome significant for large granularities As mentioned earlierthe per byte setup cost is high at small granularities and it reducesfor large granularities Since increasing peak acceleration does notreduce the per byte setup cost optimizing peak acceleration providesgains only at large granularities

        We group these individual sensitivity plots in Figure 6 to buildthe optimization regions As mentioned earlier each region indicatesthe potential of 20 gains with 10x variation of one or more LogCAparameters For the ease of understanding we color these regionsand label them with their respective LogCA parameters For exam-ple the blue colored region labelled lsquooCrsquo (16B to 2KB) indicatesan optimization region where optimizing overheads and computa-tional index is beneficial Similarly the red colored region labelledlsquoArsquo (32KB to 32MB) represents an optimization region where opti-mizing peak acceleration is only beneficial The granularity rangeoccupied by a parameter also identifies the scope of optimizationfor an architect and a programmer For example for UltraSPARCT2 overheads occupy most of the lower granularity suggesting op-portunity for improving the interface Similarly the absence of thelatency parameter suggests little benefits for optimizing latency

        We also add horizontal arrows to the optimization regions inFigure 6 to demarcate the start and end of granularity range for each

        LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

        Table 2 Description of the Cryptographic accelerators

        Crypto Accelerator PCI Crypto UltraSPARC T2 SPARC T3 SPARC T4 Sandy BridgeProcessor AMD A8-3850 S2 S2 S3 Intel Core i7-2600Frequency 29 GHz 116 GHz 165 GHz 3 GHz 34 GHzOpenSSL version 098o 098o 098o 102 101k 098oKernel Ubuntu 3130-55 Oracle Solaris 11 Oracle Solaris 11 Oracle Solaris 112 Linux2632-504

        16 128 g1 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        1000

        A

        CL

        oC AoCA

        oC

        A

        Granularity (Bytes)

        Sp

        eed

        up

        LogCA L110xo110x C10x A10x

        Figure 6 Optimization regions for UltraSPARC T2 The pres-ence of a parameter in an optimization region indicates thatit can at least provides 20 gains The horizontal arrow in-dicates the cut-off granularity at which a parameter provides20 gains

        parameter For example optimizing acceleration starts providingbenefits from 2KB while optimizing overheads or computationalindex is beneficial up till 32KB These arrows also indicate thecut-off granularity for each parameter These cut-off granularitiesprovide insights to architects and programmers about the designbottlenecks For example high cut-off granularity of 32KB suggestshigh overheads and thus a potential for optimization

        4 EXPERIMENTAL METHODOLOGYThis section describes the experimental setup and benchmarks forvalidating LogCA on real machines We also discuss our methodol-ogy for measuring LogCA parameters and performance metrics

        Our experimental setup comprises of on-chip and off-chip cryptoaccelerators (Table 2) and three different GPUs (Table 3) The on-chip crypto accelerators include cryptographic units on SunOracleUltraSPARC T2 [40] SPARC T3 [35] SPARC T4 [41] and AES-NI(AES New Instruction) [15] on Sandy Bridge whereas the off-chipaccelerator is a Hifn 7955 chip connected through the PCIe bus [43]The GPUs include a discrete NVIDIA GPU an integrated AMDGPU (APU) and HSA supported integrated GPU

        For the on-chip crypto accelerators each core in UltraSPARC T2and SPARC T3 has a physically addressed crypto unit which requiresprivileged DMA calls However the crypto unit on SPARC T4 isintegrated within the pipeline and does not require privileged DMAcalls SPARC T4 also provides non-privileged crypto instructions toaccess the crypto unit Similar to SPARC T4 sandy bridge providesnon-privileged crypto instructionmdashAESNI

        Considering the GPUs the discrete GPU is connected throughthe PCIe bus whereas for the APU the GPU is co-located with thehost processor on the same die For the APU the system memoryis partitioned between host and GPU memory This eliminates thePCIe bottleneck of data copying but it still requires copying databetween memories Unlike discrete GPU and APU HSA supportedGPU provides a unified and coherent view of the system memoryWith the host and GPU share the same virtual address space explicitcopying of data between memories is not required

        Our workloads consist of encryption hashing and GPU kernelsFor encryption and hashing we have used advanced encryptionstandard (AES) [30] and standard hashing algorithm (SHA) [31]respectively from OpenSSL [34]mdashan open source cryptography li-brary For GPU kernels we use matrix multiplication radix sortFFT and binary search from AMD OpenCL SDK [1] Table 4 we listthe complexities of each kernel both in terms of number of elementsn and granularity g We expect these complexities to remain same inboth cases but we observe that they differ for matrix multiplicationFor example for a square matrix of size n matrix multiplication hascomplexity of O (n3) whereas the complexity in terms of granularityis O (g17) This happens because for matrix multiplicationmdashunlikeothersmdashcomputations are performed on matrices and not vectorsSo offloading a square matrix of size n corresponds to offloading n2

        elements which results in the apparent discrepancy in the complexi-ties We also observe that for the granularity range of 16B to 32MBβ = 011 provides a close approximation for log(g)

        Table 3 Description of the GPUs

        Platform Discrete GPU Integrated APU AMD HSAName Tesla C2070 Radeon HD 6550 Radeon R7Architecture Fermi Beaver Creek KaveriCores 16 5 8Compute Units 448 400 512Clock Freq 15 GHz 600 MHz 720 MHzPeak FLOPS 1 T 480 G 856 GHostProcessor Intel AMD AMD

        Xeon E5520 A8-3850 A10-7850KFrequency GHz 227 29 17

        For calculating execution times we have used Linux utilities onthe crypto accelerators whereas for the GPUs we have used NVIDIAand AMD OpenCL profilers to compute the setup kernel and datatransfer times and we report the average of one hundred executionsFor verifying the usage of crypto accelerators we use built-in coun-ters in UltraSPARC T2 and T3 [46] SPARC T4 however no longer

        ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

        Table 4 Algorithmic complexity of various kernels with num-ber of elements and granularity The power of g represents β

        for each kernel

        Kernel Algorithmic ComplexityAdvanced Encryption Standard (AES) O (n) O (g101)Secure Hashing Algorithm (SHA) O (n) O (g097)Matrix Multiplication (GEMM) O (n3) O (g17)Fast Fourier Transform (FFT) O (n logn) O (g12)Radix Sort O (kn) O (g094)Binary Search O (logn) O (g014)

        Table 5 Calculated values of LogCA Parameters

        LogCA ParametersDevice Benchmark L o C A

        (cycles) (cycles) (cyclesB)

        Discrete GPU

        AES 174Radix Sort 290GEMM 3times103 2times108 2 30FFT 290Binary Search 116

        APU

        AES 174Radix Sort 290GEMM 15 4times108 2 7FFT 290Binary Search 116

        UltraSPARC T2 AES 1500 29times104 90 19SHA 105times103 72 12

        SPARC T3 AES 1500 27times104 90 12SHA 105times103 72 10

        SPARC T4 AES 500 435 32 12SHA 16times103 32 10

        SPARC T4 instr AES 4 111 32 12SHA 1638 32 10

        Sandy Bridge AES 3 10 35 6

        supports these counters so we use Linux utilities to trace the execu-tion of the crypto instructions [3] We use these execution times todetermine LogCA parameters We calculate these parameters onceand can be later used for different kernels on the same system

        For computational index and β we profile the CPU code on thehost by varying the granularity from 16B to 32MB At each granu-larity we measure the execution time and use regression analysisto determine C and β For overheads we use the observation thatfor very small granularities the execution time for a kernel on anaccelerator is dominated by the overheads ie limgrarr0 T1 (g) ≃ oFor acceleration we use different methods for the on-chip accelera-tors and GPUs For on-chip accelerators we calculate accelerationusing equation (3) and the observation that the speedup curve flat-tens out and approaches acceleration for very large granularitiesHowever for the GPUs we do not use equation (3) as it requirescomputing acceleration for each kernel as each application has adifferent access pattern which affects the speedup So we boundthe maximum performance using the peak flops from the devicespecifications We use the ratio of peak GFLOPs on CPU and GPUie A = Peak GFLOPGPU

        Peak GFLOPCPU Similar to acceleration we use two different

        techniques for calculating latency For the on-chip accelerators we

        run micro-benchmarks and use execution time on host and acceler-ators On the other hand for the GPUs we compute latency usingpeak memory bandwidth of the GPU Similar to Meswani et al [29]we use the following equation for measuring data copying time forthe GPUs L = 1

        BWpeak

        Earlier we develop our model using assumptions of granularityindependent and dependent latencies In our setup we observe thatthe on-chip crypto accelerators and HSA-enabled GPU representaccelerators with granularity independent latency while the off-chipcrypto accelerator and discrete GPUAPU represent the granular-ity dependent accelerators For each accelerator we calculate thespeedup and performance metrics using the respective equations(sect2)

        5 EVALUATIONIn this section we show that LogCA closely captures the behavior forboth off and on-chip accelerators We also list the calculate LogCAparameters in Table 5 To demonstrate the utility of our modelwe also present two case studies In these studies we consider theevolution of interface in SUNOraclersquos crypto accelerators and threedifferent GPU architectures In both cases we elaborate the designchanges using the insights LogCA provides

        51 Linear-Complexity Kernels (β = 1)Figure 7 shows the curve-fitting of LogCA for AES We considerboth off-chip and on-chip accelerators connected through differentinterfaces ranging from PCIe bus to special instructions We observethat the off-chip accelerators and APU unlike on-chip acceleratorsprovide reasonable speedup only at very large granularities We alsoobserve that the achievable speedup is limited by computationalintensity for off-chip accelerators and acceleration for on-chip accel-erators This observation supports earlier implication on the limitsof speedup for granularity independent and dependent latencies inequation (3) and (9) respectively

        Figure 7 also shows that UltraSPARC T2 provides higher speedupsthan Sandy Bridge but it breaks-even at a larger granularity SandyBridge on the other hand breaks-even at very small granularitybut provides limited speedup The discrete GPU with powerful pro-cessing cores has the highest acceleration among others Howeverits observed speedup is less than others due to high overheads andlatencies involved in communicating through the PCIe bus

        We have also marked g1 and g A2for each accelerator in Figure 7

        which help programmers and architects identify the complexity ofthe interface For example g1 for crypto instructions ie SPARCT4 and Sandy Bridge lies on the extreme left while for the off-chipaccelerators g1 lies on the far right It is worth mentioning that wehave marked g a

        2for on-chip accelerators but not for the off-chip

        accelerators For off-chip accelerators computational intensity isless than acceleration and as we have noted in equation (12) thatg A

        2for these designs does not existWe also observe that g1 for the crypto-card connected through

        the PCIe bus does not exist showing that this accelerator does notbreak-even even for large granularities Figure 7 also shows thatg1 for GPU and APU is comparable This observation shows thatdespite being an integrated GPU and not connected to the PCIe bus

        LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        CL

        Sp

        eed

        up

        (a) PCIe crypto

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100A

        g1

        CL

        (b) NVIDIA Discrete GPU

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        g1 gA2

        CL

        (c) AMD Integrated GPU (APU)

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        g1 gA2

        CL

        (d) UltraSPARC T2

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        g1 gA2

        CL

        Granularity (Bytes)

        Sp

        eed

        up

        (e) SPARC T3

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        g1 gA2

        CL

        Granularity (Bytes)

        (f) SPARC T4 engine

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        gA2

        CL

        g1 lt 16B

        Granularity (Bytes)

        (g) SPARC T4 instruction

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        CL

        g1 gA2lt 16B

        Granularity (Bytes)

        (h) AESNI on Sandy Bridge

        observed LogCA

        Figure 7 Speedup curve fittings plots comparing LogCA with the observed values of AES [30]

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        g1 gA2

        CL

        Granularity (Bytes)

        Sp

        eed

        up

        (a) UltraSPARC T2 engine

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        g1 gA2

        CL

        Granularity (Bytes)

        (b) SPARC T3 engine

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        g1 gA2

        CL

        Granularity (Bytes)

        (c) SPARC T4 engine

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        gA2

        g1 lt 16B

        CL

        Granularity (Bytes)

        (d) SPARC T4 instruction

        observed LogCA

        Figure 8 Speedup curve fittings plots comparing LogCA with the observed values of SHA256 [31] LogCA starts following observedvalues after 64B

        APU spends considerable time in copying data from the host todevice memory

        Figure 8 shows the curve fitting for SHA on various on-chipcrypto accelerators We observe that g1 and g A

        2do exist as all of

        these are on-chip accelerators We also observe that the LogCAcurve mostly follows the observed value However it deviates fromthe observed value before 64B This happens because SHA requiresblock size of 64B for hash computation If the block size is less than64B it pads extra bits to make the block size 64B Since LogCAdoes not capture this effect it does not follow the observed speedupfor granularity smaller than 64B

        Figure 9-a shows the speedup curve fitting plots for Radix sortWe observe that LogCA does not follow observed values for smallergranularities on GPU Despite this inaccuracy LogCA accuratelypredicts g1 and g A

        2 We also observe that g A

        2for GPU is higher than

        APU and this observation supports equation (7) that increasingacceleration increases g A

        2

        52 Super-Linear Complexity Kernels (β gt 1)Figures 9-b and 9-c show the speedup curve fitting plots for super-complexity kernels on discrete GPU and APU We observe that ma-trix multiplication with higher complexity (O (g17)) achieves higherspeedup than sort and FFT with lower complexities of O (g) andO (g12) respectively This observation corroborates results fromequation (9) that achievable speedup of higher-complexity algo-rithms is higher than lower-complexity algorithms We also observethat g A

        2does not exist for FFT This happens because as we note in

        equation (12) that for g A2to exist for FFT C

        L should be greater thanA

        12 However Figure 9-c shows that CL is smaller than A

        12 for bothGPU and APU

        53 Sub-Linear Complexity Kernels (β lt 1)Figure 9-d shows the curve fitting for binary search which is asub-linear algorithm (β = 014) We make three observations Firstg1 does not exist even for very large granularities and C

        L lt 1 Thisobservation supports implication (5) that for a sub-linear algorithm

        ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        g1 gA2

        CL

        Sp

        eed

        up

        GPU

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        g1 gA2

        CL

        APU

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        g1 gA2

        GPU

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        g1 gA2

        CL

        APU

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100A

        g1

        CL

        Granularity (Bytes)

        Sp

        eed

        up

        GPU

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        A

        g1

        CL

        Granularity (Bytes)

        APU

        16 128 1K 8K 64

        K51

        2K 4M 32M

        001

        01

        1

        10

        100A

        g1 gA2

        CL

        Granularity (Bytes)

        GPU

        16 128 1K 8K 64

        K51

        2K 4M 32M

        001

        01

        1

        10

        100

        A

        g1 gA2

        CL

        Granularity (Bytes)

        APU

        (a) Radix Sort (b) Matrix Multiplication

        (c) FFT (d) Binary Search

        observed LogCA

        Figure 9 Speedup curve fittings plots comparing LogCA with the observed values of (a) Radix Sort (b) Matrix Multiplication (c) FFTand (d) Binary Search

        of β = 014 CL should be greater than 7 to provide any speedup

        Second for large granularities speedup starts decreasing with anincrease in granularity This observation supports our earlier claimin implication (4) that for systems with granularity dependent la-tencies speedup for sub-linear algorithms asymptotically decreasesThird LogCA deviates from the observed value at large granularitiesThis deviation occurs because LogCA does not model caches Asmentioned earlier LogCA abstracts the caches and memories witha single parameter of latency which does not capture the memory-access pattern accurately Even though LogCA does not accuratelycaptures binary search behavior it still provides an upper bound onthe achievable performance

        54 Case StudiesFigure 10 shows the evolution of crypto accelerators in SPARCarchitectures from the off-chip accelerators in pre-Niagara (Figure 10(a)) to accelerators integrated within the pipeline in SPARC T4(Figure 10 (e)) We observe that latency is absent in the on-chipacceleratorsrsquo optimization regions as these accelerators are closelycoupled with the host We also note that the optimization regionwith overheadsmdashrepresenting the complexity of an acceleratorrsquosinterfacemdashshrinks while the optimization regions with accelerationexpand from Figure 10 (a-e) For example for the off-chip cryptoaccelerator the cut-off granularity for overheads is 256KB whereasit is 128B for the SPARC T4 suggesting a much simpler interface

        Figure 10 (a) shows the optimization regions for the off-chipcrypto accelerator connected through the PCIe bus We note thatoverheads and latencies occupy most of the optimization regionsindicating high overhead OS calls and high-latency data copyingover the PCIe bus as the bottlenecks

        Figure 10 (b) shows the optimization regions for UltraSPARCT2 The large cut-off granularity for overheads at 32KB suggestsa complex interface indicating high overhead OS call creating abottleneck at small granularities The cut-off granularity of 2KB foracceleration suggests that optimizing acceleration is beneficial atlarge granularities

        Figure 10 (d) shows optimization regions for on-chip acceleratoron SPARC T4 There are three optimization regions with the cut-offgranularity for overhead now reduced to only 512B This observationsuggests a considerable improvement in the interface design overSPARC T3 and it is also evident by a smaller g1 We also note thatcut-off granularity for acceleration now decreases to 32B showingan increase in the opportunity for optimizing acceleration

        Figure 10 (e) shows optimization regions for crypto instructionson SPARC T4 We observe that unlike earlier designs it has only twooptimization regions and the speedup approaches the peak accelera-tion at a small granularity of 128B In contrast UltraSPARC T2 andSPARC T3 do not even provide any gains at this granularity We alsoobserve that the cut-off granularity for overheads further reduces to128B suggesting some opportunity for optimization at very smallgranularities The model also shows that the acceleration occupiesthe maximum range for optimization For example optimizing accel-eration provides benefits for granularities greater than 16B The lowoverhead access which LogCA shows is due to the non-privilegedinstruction SPARC T4 uses to access the cryptographic unit whichis integrated within the pipeline

        Figure 11 shows the evolution of memory interface design inGPU architectures It shows the optimization regions for matrixmultiplication on a discrete NVIDIA GPU an AMD integrated GPU(APU) and an integrated AMD GPU with HSA support We observethat matrix multiplication for all three architectures is compute bound

        LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        1000

        A

        CL

        oCLoC

        LC

        oL

        Granularity (Bytes)

        Sp

        eed

        up

        (a) PCIe Crypto Accelerator

        16 128 g1 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        1000

        A

        CL

        oC AoCA

        oA

        Granularity (Bytes)

        (b) UltraSPARC T2

        16 128 g1 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        1000CL

        A

        oC AoCA

        oA

        Granularity (Bytes)

        (c) SPARC T3

        g112

        8 1K 8K 64K

        512K 4M 32

        M

        01

        1

        10

        100

        1000

        A

        oCA A

        CL

        oA

        Granularity (Bytes)

        Sp

        eed

        up

        (d) SPARC T4 engine

        16 128 1K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        1000

        A

        oCA A

        CL

        oA

        Granularity (Bytes)

        (e) SPARC T4 instruction

        LogCA L110xo110x C10x A10x

        Figure 10 LogCA for performing Advanced Encryption Standard on various crypto accelerators LogCA identifies the design bottle-necks through LogCA parameters in an optimization region The bottlenecks which LogCA suggests in each design is optimized inthe next design

        16 128 1K 8K g1

        64K

        512K 4M 32

        M

        01

        1

        10

        100

        1000

        A

        LoC LCALC A

        ALC

        o

        Granularity (Bytes)

        Sp

        eed

        up

        (a) NVIDIA Discrete GPU

        16 128 1K 8K g1

        64K

        512K 4M 32

        M

        01

        1

        10

        100

        1000

        A

        LoCLo

        CA

        ACA

        AoL

        C

        Granularity (Bytes)

        (b) AMD Integrated GPU (APU)

        16 128 g11K 8K 64

        K51

        2K 4M 32M

        01

        1

        10

        100

        1000

        A

        oC o

        CA

        CA

        o

        AC

        Granularity (Bytes)

        (c) HSA supported AMD Integrated GPU

        LogCA L110xo110x C10x A10x

        Figure 11 Various Optimization regions for matrix multiplication over a range of granularities on (a) NVIDIA discrete GPU (b)AMD APU and (c) HSA Supported GPU

        (sect31) We also observe that the computational index occupies mostof the regions which signifies maximum optimization potential

        The discrete GPU has four optimization regions (Figure 11 (a))Among these latency dominates most of the regions signifyinghigh-latency data copying over the PCIe bus and thus maximumoptimization potential The high cut-off granularity for overheads at

        32KB indicates high overhead OS calls to access the GPU Similarlywith highly aggressive cores acceleration has high cut-off granular-ity of 256KB indicating less optimization potential for acceleration

        Similar to the discrete GPU the APU also has four optimiza-tion regions (Figure 11 (b)) There are few notable differences ascompared to the discrete GPU The cut-off granularity for latency

        ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

        reduces to 512KB with the elimination of data copying over thePCIe bus the overheads are still high suggesting high overhead OScalls to access the APU with less aggressive cores the cut-off granu-larity for acceleration reduces to 64KB implying more optimizationpotential for acceleration

        Figure 11 (c) shows three optimization regions for the HSA en-abled integrated GPU We observe that latency is absent in all regionsand the cut-off granularity for overhead reduces to 8KB These re-ductions in overheads and latencies signify a simpler interface ascompared to the discrete GPU and APU We also observe that thecut-off granularity for acceleration drops to 2KB suggesting higherpotential for optimizing acceleration

        6 RELATED WORKWe compare and contrast our work with prior approaches Lopez-Novoa et al [28] provide a detailed survey of various acceleratormodeling techniques We broadly classify these techniques in twocategories and discuss the most relevant work

        Analytical Models There is a rich body of work exploring ana-lytical models for performance prediction of accelerators For somemodels the motivation is to determine the future trend in heteroge-neous architectures Chung et al [7] in a detailed study predict thefuture landscape of heterogeneous computing Hempstead et al [17]propose an early-stage model Navigo that determines the fraction ofarea required for accelerators to maintain the traditional performancetrend Nilakantan et al [32] propose to incorporate communicationcost for early-stage model of accelerator-rich architectures For oth-ers the motivation is to determine the right amount of data to offloadand the potential benefits associated with an accelerator [24]

        Some analytical models are architecture specific For examplea number of studies [20 21 44 57] predict performance of GPUarchitectures Hong et al [20] present an analytical performancemodel for predicting execution time on GPUs They later extendtheir model and develop an integrated power and performance modelfor the GPUs [21] Song et al [44] use a simple counter basedapproach to predict power and performance Meswani et al [29]explore such models for high performance applications Daga etal [11] analyze the effectiveness of Accelerated processing units(APU) over GPUs and describe the communication cost over thePCIe bus as a major bottleneck in exploiting the full potential ofGPUs

        In general our work is different from these studies because ofthe complexity These models use a large number of parameters toaccurately predict the power andor performance whereas we limitthe number of parameters to reduce the complexity of our modelThey also require deep understanding of the underlying architectureMost of these models also require access to GPU specific assemblyor PTX codes Unlike these approaches we use CPU code to providebounds on the performance

        Roofline Models In terms of simplicity and motivation our workclosely matches the Roofline model [54]mdasha visual performancemodel for multi-core architectures Roofline exposes bottlenecks fora kernel and suggests several optimizations which programmers canuse to fine tune the kernel on a given system

        A number of extensions of Roofline have been proposed [1022 33 56] and some of these extensions are architecture specific

        For example targeting GPUs [22] vector processors [39] and FP-GAs [10 56]

        Despite the similarities roofline and its extensions cannot be usedfor exposing design bottlenecks in an acceleratorrsquos interface Theprimary goal of roofline models has been to help programmers andcompiler writer while LogCA provides more insights for architects

        7 CONCLUSION AND FUTURE WORKWith the recent trend towards heterogeneous computing we feelthat the architecture community lacks a model to reason about theneed of accelerators In this respect we propose LogCAmdashan insight-ful visual performance model for hardware accelerators LogCAprovides insights early in the design stage to both architects andprogrammers and identifies performance bounds exposes interfacedesign bottlenecks and suggest optimizations to alleviate these bot-tlenecks We have validated our model across a range of on-chip andoff-chip accelerators and have shown its utility using retrospectivestudies describing the evolution of acceleratorrsquos interface in thesearchitectures

        The applicability of LogCA can be limited by our simplifying as-sumptions and for more realistic analysis we plan to overcome theselimitations in our future work For example We also assume a singleaccelerator system and do not explicitly model contention amongresources Our model should handle multi-accelerator and pipelinedscenarios For fixed function accelerators our design space is cur-rently limited to encryption and hashing kernels To overcome thiswe are expanding our design space with compression and databaseaccelerators in Oracle M7 processor We also plan to complementLogCA with an energy model as energy efficiency is a prime designmetric for accelerators

        ACKNOWLEDGEMENTSWe thank our anonymous reviewers Arkaprava Basu and TonyNowatzki for their insightful comments and feedback on the paperThanks to Mark Hill Michael Swift Wisconsin Computer Archi-tecture Affiliates and other members of the Multifacet group fortheir valuable discussions We also thank Brian Wilson at Universityof Wisconsin DoIT and Eric Sedlar at Oracle Labs for providingaccess to SPARC T3 and T4 servers respectively Also thanks toMuhammad Umair Bin Altaf for his help in the formulation Thiswork is supported in part by the National Science Foundation (CNS-1302260 CCF-1438992 CCF-1533885 CCF- 1617824) Googleand the University of Wisconsin-Madison (Amar and Balindar SohiProfessorship in Computer Science) Wood has a significant financialinterest in AMD and Google

        REFERENCES[1] Advanced Micro Devices 2016 APP SDK - A Complete Development Platform

        Advanced Micro Devices httpdeveloperamdcomtools-and-sdksopencl-zoneamd-accelerated-parallel-processing-app-sdk

        [2] Gene M Amdahl 1967 Validity of the single processor approach to achievinglarge scale computing capabilities Proceedings of the April 18-20 1967 springjoint computer conference on - AFIPS rsquo67 (Spring) (1967) 483 httpsdoiorg10114514654821465560

        [3] Dan Anderson 2012 How to tell if SPARC T4 crypto is being used httpsblogsoraclecomDanXentryhow_to_tell_if_sparc

        [4] Krste Asanovic Rastislav Bodik James Demmel Tony Keaveny Kurt KeutzerJohn Kubiatowicz Nelson Morgan David Patterson Koushik Sen John

        LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

        Wawrzynek David Wessel and Katherine Yelick 2009 A View of the Par-allel Computing Landscape Commun ACM 52 10 (oct 2009) 56ndash67 httpsdoiorg10114515627641562783

        [5] Nathan Beckmann and Daniel Sanchez 2016 Cache Calculus Modeling Cachesthrough Differential Equations Computer Architecture Letters PP 99 (2016)1 httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=7366753$delimiter026E30F$npapers3publicationdoi101109LCA20152512873

        [6] C Cascaval S Chatterjee H Franke K J Gildea and P Pattnaik 2010 Ataxonomy of accelerator architectures and their programming models IBMJournal of Research and Development 54 (2010) 51ndash510 httpsdoiorg101147JRD20102059721

        [7] Eric S Chung Peter a Milder James C Hoe and Ken Mai 2010 Single-ChipHeterogeneous Computing Does the Future Include Custom Logic FPGAs andGPGPUs 2010 43rd Annual IEEEACM International Symposium on Microar-chitecture (dec 2010) 225ndash236 httpsdoiorg101109MICRO201036

        [8] Jason Cong Zhenman Fang Michael Gill and Glenn Reinman 2015 PARADEA Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Archi-tectural Design and Exploration In 2015 IEEEACM International Conference onComputer-Aided Design Austin TX

        [9] D Culler R Karp D Patterson and A Sahay 1993 LogP Towards a realisticmodel of parallel computation In Proceedings of the Fourth ACM SIGPLANSymposium on Principles and Practice of Parallel Programming 1ndash12 httpdlacmorgcitationcfmid=155333

        [10] Bruno da Silva An Braeken Erik H DrsquoHollander and Abdellah Touhafi 2013Performance Modeling for FPGAs Extending the Roofline Model with High-level Synthesis Tools Int J Reconfig Comput 2013 (jan 2013) 77mdash-77httpsdoiorg1011552013428078

        [11] Mayank Daga Ashwin M Aji and Wu-chun Feng 2011 On the Efficacy of aFused CPU+GPU Processor (or APU) for Parallel Computing In 2011 Symposiumon Application Accelerators in High-Performance Computing Ieee 141ndash149httpsdoiorg101109SAAHPC201129

        [12] Hadi Esmaeilzadeh Emily Blem Renee St Amant Karthikeyan Sankaralingamand Doug Burger 2011 Dark silicon and the end of multicore scaling InProceeding of the 38th annual international symposium on Computer archi-tecture - ISCA rsquo11 ACM Press New York New York USA 365 httpsdoiorg10114520000642000108

        [13] H Franke J Xenidis C Basso B M Bass S S Woodward J D Brown andC L Johnson 2010 Introduction to the wire-speed processor and architectureIBM Journal of Research and Development 54 (2010) 31ndash311 httpsdoiorg101147JRD20092036980

        [14] Venkatraman Govindaraju Chen Han Ho and Karthikeyan Sankaralingam 2011Dynamically specialized datapaths for energy efficient computing In Proceedings- International Symposium on High-Performance Computer Architecture 503ndash514httpsdoiorg101109HPCA20115749755

        [15] Shay Gueron 2012 Intel Advanced Encryption Standard (AES) Instructions SetTechnical Report Intel Corporation httpssoftwareintelcomsitesdefaultfilesarticle165683aes-wp-2012-09-22-v01pdf

        [16] Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex SolomatnikovBenjamin C Lee Stephen Richardson Christos Kozyrakis and Mark Horowitz2010 Understanding sources of inefficiency in general-purpose chips Proceed-ings of the 37th annual international symposium on Computer architecture - ISCA

        rsquo10 (2010) 37 httpsdoiorg10114518159611815968[17] Mark Hempstead Gu-Yeon Wei and David Brooks 2009 Navigo An early-

        stage model to study power-constrained architectures and specialization In ISCAWorkshop on Modeling Benchmarking and Simulations (MoBS)

        [18] John L Hennessy and David A Patterson 2006 Computer Architecture FourthEdition A Quantitative Approach 704 pages httpsdoiorg10111151881

        [19] Mark D Hill and Michael R Marty 2008 Amdahlrsquos Law in the Multicore EraComputer 41 7 (jul 2008) 33ndash38 httpsdoiorg101109MC2008209

        [20] Sunpyo Hong and Hyesoon Kim 2009 An analytical model for a GPU architec-ture with memory-level and thread-level parallelism awareness In Proceedingsof the 36th Annual International Symposium on Computer Architecture Vol 37152ndash163 httpsdoiorg10114515558151555775

        [21] Sunpyo Hong and Hyesoon Kim 2010 An integrated GPU power and perfor-mance model In Proceedings of the 37th Annual International Symposium onComputer Architecture Vol 38 280mdash-289 httpsdoiorg10114518160381815998

        [22] Haipeng Jia Yunquan Zhang Guoping Long Jianliang Xu Shengen Yan andYan Li 2012 GPURoofline A Model for Guiding Performance Optimizations onGPUs In Proceedings of the 18th International Conference on Parallel Processing(Euro-Parrsquo12) Springer-Verlag Berlin Heidelberg 920ndash932 httpsdoiorg101007978-3-642-32820-6_90

        [23] Onur Kocberber Boris Grot Javier Picorel Babak Falsafi Kevin Lim andParthasarathy Ranganathan 2013 Meet the Walkers Accelerating Index Traver-sals for In-memory Databases In Proceedings of the 46th Annual IEEEACMInternational Symposium on Microarchitecture (MICRO-46) ACM New YorkNY USA 468ndash479 httpsdoiorg10114525407082540748

        [24] Karthik Kumar Jibang Liu Yung Hsiang Lu and Bharat Bhargava 2013 Asurvey of computation offloading for mobile systems Mobile Networks andApplications 18 (2013) 129ndash140 httpsdoiorg101007s11036-012-0368-0

        [25] Snehasish Kumar Naveen Vedula Arrvindh Shriraman and Vijayalakshmi Srini-vasan 2015 DASX Hardware Accelerator for Software Data Structures InProceedings of the 29th ACM on International Conference on Supercomputing(ICS rsquo15) ACM New York NY USA 361ndash372 httpsdoiorg10114527512052751231

        [26] Maysam Lavasani Hari Angepat and Derek Chiou 2014 An FPGA-based In-Line Accelerator for Memcached IEEE Comput Archit Lett 13 2 (jul 2014)57ndash60 httpsdoiorg101109L-CA201317

        [27] John D C Little and Stephen C Graves 2008 Littlersquos law In Building intuitionSpringer 81ndash100

        [28] U Lopez-Novoa A Mendiburu and J Miguel-Alonso 2015 A Survey of Perfor-mance Modeling and Simulation Techniques for Accelerator-Based ComputingParallel and Distributed Systems IEEE Transactions on 26 1 (jan 2015) 272ndash281httpsdoiorg101109TPDS20142308216

        [29] M R Meswani L Carrington D Unat A Snavely S Baden and S Poole2013 Modeling and predicting performance of high performance computingapplications on hardware accelerators International Journal of High Perfor-mance Computing Applications 27 (2013) 89ndash108 httpsdoiorg1011771094342012468180

        [30] National Institute of Standards and Technology 2001 Advanced EncryptionStandard (AES) National Institute of Standards and Technology httpsdoiorg106028NISTFIPS197

        [31] National Institute of Standards and Technology 2008 Secure Hash StandardNational Institute of Standards and Technology httpcsrcnistgovpublicationsfipsfips180-3fips180-3_finalpdf

        [32] S Nilakantan S Battle and M Hempstead 2013 Metrics for Early-Stage Model-ing of Many-Accelerator Architectures Computer Architecture Letters 12 1 (jan2013) 25ndash28 httpsdoiorg101109L-CA20129

        [33] Cedric Nugteren and Henk Corporaal 2012 The Boat Hull Model EnablingPerformance Prediction for Parallel Computing Prior to Code Development Cate-gories and Subject Descriptors In Proceedings of the 9th Conference on Comput-ing Frontiers ACM 203mdash-212

        [34] OpenSSL Software Foundation 2015 OpenSSL Cryptography and SSLTLSToolkit OpenSSL Software Foundation httpsopensslorg

        [35] Sanjay Patel 2009 Sunrsquos Next-Generation Multithreaded Processor RainbowFalls In 21st Hot Chip Symposium httpwwwhotchipsorgwp-contentuploadshc

        [36] Sanjay Patel and Wen-mei W Hwu 2008 Accelerator Architectures IEEE Micro28 4 (jul 2008) 4ndash12 httpsdoiorg101109MM200850

        [37] Stephen Phillips 2014 M7 Next Generation SPARC In 26th Hot Chip Sympo-sium

        [38] Phil Rogers 2013 Heterogeneous system architecture overview In Hot Chips[39] Yoshiei Sato Ryuichi Nagaoka Akihiro Musa Ryusuke Egawa Hiroyuki Tak-

        izawa Koki Okabe and Hiroaki Kobayashi 2009 Performance tuning andanalysis of future vector processors based on the roofline model Proceedingsof the 10th MEDEA workshop on MEmory performance DEaling with Applica-tions systems and architecture - MEDEA rsquo09 (2009) 7 httpsdoiorg10114516219601621962

        [40] M Shah J Barren J Brooks R Golla G Grohoski N Gura R Hetherington PJordan M Luttrell C Olson B Sana D Sheahan L Spracklen and A Wynn2007 UltraSPARC T2 A highly-treaded power-efficient SPARC SOC InSolid-State Circuits Conference 2007 ASSCC rsquo07 IEEE Asian 22ndash25 httpsdoiorg101109ASSCC20074425786

        [41] Manish Shah Robert Golla Gregory Grohoski Paul Jordan Jama Barreh JeffreyBrooks Mark Greenberg Gideon Levinsky Mark Luttrell Christopher OlsonZeid Samoail Matt Smittle and Thomas Ziaja 2012 Sparc T4 A dynamicallythreaded server-on-a-chip IEEE Micro 32 (2012) 8ndash19 httpsdoiorg101109MM20121

        [42] Yakun Sophia Shao Brandon Reagen Gu-Yeon Wei and David Brooks 2014Aladdin A Pre-RTL Power-Performance Accelerator Simulator Enabling LargeDesign Space Exploration of Customized Architectures In International Sympo-sium on Computer Architecture (ISCA)

        [43] Soekris Engineering 2016 vpn 1401 for Std PCI-sockets Soekris Engineeringhttpsoekriscomproductsvpn-1401html

        [44] Shuaiwen Song Chunyi Su Barry Rountree and Kirk W Cameron 2013 Asimplified and accurate model of power-performance efficiency on emergent GPUarchitectures In Proceedings - IEEE 27th International Parallel and DistributedProcessing Symposium IPDPS 2013 673ndash686 httpsdoiorg101109IPDPS201373

        [45] Jeff Stuecheli 2013 POWER8 In 25th Hot Chip Symposium[46] Ning Sun and Chi-Chang Lin 2007 Using the Cryptographic Accelerators in the

        UltraSPARC T1 and T2 processors Technical Report httpwwworaclecomtechnetworkserver-storagesolarisdocumentation819-5782-150147pdf

        ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

        [47] S Tabik G Ortega and E M Garzoacuten 2014 Performance evaluation of ker-nel fusion BLAS routines on the GPU iterative solvers as case study TheJournal of Supercomputing 70 2 (nov 2014) 577ndash587 httpsdoiorg101007s11227-014-1102-4

        [48] Y C Tay 2013 Analytical Performance Modeling for Computer Systems (2nded) Morgan amp Claypool Publishers

        [49] MB Taylor 2012 Is dark silicon useful harnessing the four horsemen of thecoming dark silicon apocalypse In Design Automation Conference (DAC) 201249th ACMEDACIEEE 1131ndash1136 httpsdoiorg10114522283602228567

        [50] G Venkatesh J Sampson N Goulding S Garcia V Bryksin J Lugo-MartinezS Swanson and M B Taylor 2010 Conservation cores Reducing the energyof mature computations In International Conference on Architectural Supportfor Programming Languages and Operating Systems - ASPLOS 205ndash218 httpsdoiorg10114517360201736044

        [51] Ganesh Venkatesh Jack Sampson Nathan Goulding-Hotta Sravanthi KotaVenkata Michael Bedford Taylor and Steven Swanson 2011 QsCores Trad-ing Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores InProceedings of the 44th Annual IEEEACM International Symposium on Microar-chitecture - MICRO-44 rsquo11 163 httpsdoiorg10114521556202155640

        [52] Guibin Wang Yisong Lin and Wei Yi 2010 Kernel Fusion An EffectiveMethod for Better Power Efficiency on Multithreaded GPU In Green Computingand Communications (GreenCom) 2010 IEEEACM Intrsquol Conference on CyberPhysical and Social Computing (CPSCom) 344ndash350 httpsdoiorg101109GreenCom-CPSCom2010102

        [53] Eric W Weisstein 2015 Newtonrsquos Method From MathWorld ndash A Wolfram WebResource httpmathworldwolframcomNewtonsMethodhtml

        [54] Samuel Williams Andrew Waterman and David Patterson 2009 Roofline aninsightful visual performance model for multicore architectures Commun ACM52 (2009) 65ndash76 httpsdoiorg10114514987651498785

        [55] Lisa Wu Andrea Lottarini Timothy K Paine Martha A Kim and Kenneth ARoss 2014 Q100 The Architecture and Design of a Database Processing UnitIn Proceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS rsquo14) ACM NewYork NY USA 255ndash268 httpsdoiorg10114525419402541961

        [56] Moein Pahlavan Yali 2014 FPGA-Roofline An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems PhD DissertationVirginia Polytechnic Institute and State University

        [57] Yao Zhang and John D Owens 2011 A quantitative performance analysismodel for GPU architectures In Proceedings - International Symposium on High-Performance Computer Architecture 382ndash393 httpsdoiorg101109HPCA20115749745

        • Abstract
        • 1 Introduction
        • 2 The LogCA Model
          • 21 Effect of Granularity
          • 22 Performance Metrics
          • 23 Granularity dependent latency
            • 3 Applications of LogCA
              • 31 Performance Bounds
              • 32 Sensitivity Analysis
                • 4 Experimental Methodology
                • 5 Evaluation
                  • 51 Linear-Complexity Kernels (= 1)
                  • 52 Super-Linear Complexity Kernels (gt 1)
                  • 53 Sub-Linear Complexity Kernels (lt 1)
                  • 54 Case Studies
                    • 6 Related Work
                    • 7 Conclusion and Future Work
                    • References

          LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

          IMPLICATION 4 With granularity dependent latency the speedupfor sub-linear algorithms asymptotically decreases with the increasein granularity

          The above implication suggests that for sub-linear algorithms onsystems with granularity dependent latency speedup may decreasefor some large granularities This happens because for large granu-larities the communication latency (a linear function of granularity)may be higher than the computation time (a sub-linear function ofgranularity) on the accelerator resulting in a net de-accelerationThis implication is surprising as earlier we observed thatmdashfor sys-tems with granularity independent latencymdashspeedup for all algo-rithms increase with granularity and approaches acceleration forvery large granularities

          For very small granularities equation (8) reduces to

          limgrarr 0

          Speedup(g) ≃ Alowast CAlowast (o+L)+C

          (10)

          Similar to equation (4) the above equation exposes the increasingeffects of overheads at small granularities Solving equation (8) forg1 using Newtonrsquos method [53]

          g1 =C lowast (β minus1) lowast (Aminus1)+Alowasto

          C lowastβ lowast (Aminus1)minusAlowastL(11)

          For a positive value of g1 equation (11) must satisfy CL gt 1

          β

          Thus for achieving any speedup for linear algorithms CL should

          be at least 1 However for super-linear algorithms a speedup of 1can achieved at values of C

          L smaller than 1 whereas for sub-linearalgorithms algorithms C

          L must be greater than 1

          IMPLICATION 5 With granularity dependent latency computa-tional intensity for sub-linear algorithms should be greater than 1to achieve any gains

          Thus for sub-linear algorithms computational index has to begreater than latency to justify offloading the work However forhigher-complexity algorithms computational index can be quitesmall and still be potentially useful to offload

          Similarly solving equation (8) using Newtonrsquos method for g A2

          gives

          g A2=

          C lowast (β minus1)+AlowastoC lowastβ minusAlowastL

          (12)

          For a positive value of g A2 equation (12) must satisfy CA

          L gt 1β

          Thus for achieving a speedup of A2 CL should be at least lsquoArsquo for

          linear algorithms However for super-linear algorithms a speedupof A

          2 can achieved at values of CL smaller than lsquoArsquo whereas for

          sub-linear algorithms CL must be greater than lsquoArsquo

          IMPLICATION 6 With granularity dependent latency accelera-torrsquos computational intensity for sub-linear algorithms should begreater than 1 to achieve speedup of half of the acceleration

          The above implication suggests that for achieving half of theacceleration with sub-linear algorithms the computation time on theaccelerator must be greater than latency However for super-linearalgorithms that speedup can be achieved even if the computationtime on accelerator is lower than latency Programmers can usethe above implications to determinemdashearly in the design cyclemdashwhether to put time and effort in porting a code to an accelerator

          g1

          1

          A

          CL

          limgrarrinfin Speedup(g) = A

          CL gt A

          Sp

          eed

          up

          g1

          1

          CL

          A

          limgrarrinfin Speedup(g) = A

          CL lt A

          g1

          1

          CL

          A

          limgrarrinfin Speedup(g) = CL

          CL lt A

          Granularity (Bytes)

          Sp

          eed

          up

          g1

          1

          CL

          A

          limgrarrinfin Speedup(g) lt CL

          CL lt A

          Granularity (Bytes)

          (a) Performance bounds for compute-bound kernels

          (b) Performance bounds for latency-bound kernels

          Figure 4 LogCA helps in visually identifying (a) compute and(b) latency bound kernels

          For example consider a system with a minimum desirable speedupof one half of the acceleration but has a computational intensity ofless than the acceleration With the above implication architectsand programmers can infer early in the design stage that the desiredspeedup can not be achieved for sub-linear and linear algorithmsHowever the desired speedup can be achieved with super-linearalgorithms

          We are also interested in quantifying the limits on achievablespeedup due to overheads and latencies To do this we assume ahypothetical accelerator with infinite acceleration and calculate thegranularity (gA) to achieve the peak speedup of lsquoArsquo With this as-sumption the desired speedup of lsquoArsquo is only limited by the overheadsand latencies Solving equation (8) for gA gives

          gA =C lowast (β minus1)+Alowasto

          C lowastβ minusAlowastL(13)

          Surprisingly we find that the above equation is similar to equa-tion (12) ie gA equals g A

          2 This observation shows that with a

          hypothetical accelerator the peak speedup can now be achieved atthe same granularity as g A

          2 This observation also demonstrates that

          if g A2is not achievable on a system ie CA

          L lt 1β

          as per equation(12) then despite increasing the acceleration gA will not be achiev-able and the speedup will still be bounded by the computationalintensity

          IMPLICATION 7 If a speedup of A2 is not achievable on an ac-

          celerator with acceleration lsquoArsquo despite increasing acceleration toAtilde (where Atilde gt A) the speedup is bounded by the computationalintensity

          The above implication helps architects in allocating more re-sources for an efficient interface instead of increasing acceleration

          ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          1000

          A

          No variation

          Granularity (Bytes)

          Sp

          eed

          up

          (a) Latency

          LogCAL110x

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          1000

          A

          Granularity (Bytes)

          (b) Overheads

          LogCAo110x

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          1000

          A

          Granularity (Bytes)

          (c) Computational Index

          LogCAC10x

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          1000

          A

          CL

          Granularity (Bytes)

          (d) Acceleration

          LogCAA10x

          Figure 5 The effect on speedup of 10x improvement in each LogCA parameter The base case is the speedup of AES [30] on Ultra-SPARC T2

          3 APPLICATIONS OF LogCAIn this section we describe the utility of LogCA for visually iden-tifying the performance bounds design bottlenecks and possibleoptimizations to alleviate these bottlenecks

          31 Performance BoundsEarlier we have observed that the speedup is bounded by eitheracceleration (equation 3) or the product of computational intensityand gβminus1 (equation 9) Using these observations we classify kernelseither as compute-bound or latency-bound For compute-bound ker-nels the achievable speedup is bounded by acceleration whereas forthe latency-bound kernels the speedup is bounded by computationalintensity Based on this classification a compute-bound kernel caneither be running on a system with granularity independent latencyor has super-linear complexity while running on a system with gran-ularity dependent latency Figure 4-a illustrates these bounds forcompute-bound kernels On the other hand a latency-bound kernelis running on a system with granularity dependent latency and haseither linear or sub-linear complexity Figure 4-b illustrates thesebounds for latency-bound kernels

          Programmers and architects can visually identify these boundsand use this information to invest their time and resources in the rightdirection For example for compute-bound kernelsmdashdependingon the operating granularitymdashit may be beneficial to invest moreresources in either increasing acceleration or reducing overheadsHowever for latency-bound kernels optimizing acceleration andoverheads is not that critical but decreasing latency and increasingcomputational index maybe more beneficial

          32 Sensitivity AnalysisTo identify the design bottlenecks we perform a sensitivity analysisof the LogCA parameters We consider a parameter a design bottle-neck if a 10x improvement in it provides at lest 20 improvement inspeedup A lsquobottleneckedrsquo parameter also provides an optimizationopportunity To visually identify these bottlenecks we introduceoptimization regions As an example we identify design bottlenecksin UltraSPARC T2rsquos crypto accelerator by varying its individualparameters 2 in Figure 5 (a)-(d)

          2We elaborate our methodology for measuring LogCA parameters later (sect 4)

          Figure 5 (a) shows the variation (or the lack of) in speedup withthe decrease in latency The resulting gains are negligible and inde-pendent of the granularity as it is a closely coupled accelerator

          Figure 5 (b) shows the resulting speedup after reducing overheadsSince the overheads are one-time initialization cost and independentof granularity the per byte setup cost is high at small granularitiesDecreasing these overheads considerably reduces the per byte setupcost and results in significant gains at these smaller granularitiesConversely for larger granularities the per byte setup cost is alreadyamortized so reducing overheads does not provide much gainsThus overhead is a bottleneck at small granularities and provide anopportunity for optimization

          Figure 5 (c) shows the effect of increasing the computationalindex The results are similar to optimizing overheads in Figure 5 (b)ie significant gains for small granularities and a gradual decreasein the gains with increasing granularity With the constant overheadsincreasing computational index increases the computation time of thekernel and decreases the per byte setup cost For smaller granularitiesthe reduced per byte setup cost results in significant gains

          Figure 5 (d) shows the variation in speedup with increasing peakacceleration The gains are negligible at small granularities andbecome significant for large granularities As mentioned earlierthe per byte setup cost is high at small granularities and it reducesfor large granularities Since increasing peak acceleration does notreduce the per byte setup cost optimizing peak acceleration providesgains only at large granularities

          We group these individual sensitivity plots in Figure 6 to buildthe optimization regions As mentioned earlier each region indicatesthe potential of 20 gains with 10x variation of one or more LogCAparameters For the ease of understanding we color these regionsand label them with their respective LogCA parameters For exam-ple the blue colored region labelled lsquooCrsquo (16B to 2KB) indicatesan optimization region where optimizing overheads and computa-tional index is beneficial Similarly the red colored region labelledlsquoArsquo (32KB to 32MB) represents an optimization region where opti-mizing peak acceleration is only beneficial The granularity rangeoccupied by a parameter also identifies the scope of optimizationfor an architect and a programmer For example for UltraSPARCT2 overheads occupy most of the lower granularity suggesting op-portunity for improving the interface Similarly the absence of thelatency parameter suggests little benefits for optimizing latency

          We also add horizontal arrows to the optimization regions inFigure 6 to demarcate the start and end of granularity range for each

          LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

          Table 2 Description of the Cryptographic accelerators

          Crypto Accelerator PCI Crypto UltraSPARC T2 SPARC T3 SPARC T4 Sandy BridgeProcessor AMD A8-3850 S2 S2 S3 Intel Core i7-2600Frequency 29 GHz 116 GHz 165 GHz 3 GHz 34 GHzOpenSSL version 098o 098o 098o 102 101k 098oKernel Ubuntu 3130-55 Oracle Solaris 11 Oracle Solaris 11 Oracle Solaris 112 Linux2632-504

          16 128 g1 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          1000

          A

          CL

          oC AoCA

          oC

          A

          Granularity (Bytes)

          Sp

          eed

          up

          LogCA L110xo110x C10x A10x

          Figure 6 Optimization regions for UltraSPARC T2 The pres-ence of a parameter in an optimization region indicates thatit can at least provides 20 gains The horizontal arrow in-dicates the cut-off granularity at which a parameter provides20 gains

          parameter For example optimizing acceleration starts providingbenefits from 2KB while optimizing overheads or computationalindex is beneficial up till 32KB These arrows also indicate thecut-off granularity for each parameter These cut-off granularitiesprovide insights to architects and programmers about the designbottlenecks For example high cut-off granularity of 32KB suggestshigh overheads and thus a potential for optimization

          4 EXPERIMENTAL METHODOLOGYThis section describes the experimental setup and benchmarks forvalidating LogCA on real machines We also discuss our methodol-ogy for measuring LogCA parameters and performance metrics

          Our experimental setup comprises of on-chip and off-chip cryptoaccelerators (Table 2) and three different GPUs (Table 3) The on-chip crypto accelerators include cryptographic units on SunOracleUltraSPARC T2 [40] SPARC T3 [35] SPARC T4 [41] and AES-NI(AES New Instruction) [15] on Sandy Bridge whereas the off-chipaccelerator is a Hifn 7955 chip connected through the PCIe bus [43]The GPUs include a discrete NVIDIA GPU an integrated AMDGPU (APU) and HSA supported integrated GPU

          For the on-chip crypto accelerators each core in UltraSPARC T2and SPARC T3 has a physically addressed crypto unit which requiresprivileged DMA calls However the crypto unit on SPARC T4 isintegrated within the pipeline and does not require privileged DMAcalls SPARC T4 also provides non-privileged crypto instructions toaccess the crypto unit Similar to SPARC T4 sandy bridge providesnon-privileged crypto instructionmdashAESNI

          Considering the GPUs the discrete GPU is connected throughthe PCIe bus whereas for the APU the GPU is co-located with thehost processor on the same die For the APU the system memoryis partitioned between host and GPU memory This eliminates thePCIe bottleneck of data copying but it still requires copying databetween memories Unlike discrete GPU and APU HSA supportedGPU provides a unified and coherent view of the system memoryWith the host and GPU share the same virtual address space explicitcopying of data between memories is not required

          Our workloads consist of encryption hashing and GPU kernelsFor encryption and hashing we have used advanced encryptionstandard (AES) [30] and standard hashing algorithm (SHA) [31]respectively from OpenSSL [34]mdashan open source cryptography li-brary For GPU kernels we use matrix multiplication radix sortFFT and binary search from AMD OpenCL SDK [1] Table 4 we listthe complexities of each kernel both in terms of number of elementsn and granularity g We expect these complexities to remain same inboth cases but we observe that they differ for matrix multiplicationFor example for a square matrix of size n matrix multiplication hascomplexity of O (n3) whereas the complexity in terms of granularityis O (g17) This happens because for matrix multiplicationmdashunlikeothersmdashcomputations are performed on matrices and not vectorsSo offloading a square matrix of size n corresponds to offloading n2

          elements which results in the apparent discrepancy in the complexi-ties We also observe that for the granularity range of 16B to 32MBβ = 011 provides a close approximation for log(g)

          Table 3 Description of the GPUs

          Platform Discrete GPU Integrated APU AMD HSAName Tesla C2070 Radeon HD 6550 Radeon R7Architecture Fermi Beaver Creek KaveriCores 16 5 8Compute Units 448 400 512Clock Freq 15 GHz 600 MHz 720 MHzPeak FLOPS 1 T 480 G 856 GHostProcessor Intel AMD AMD

          Xeon E5520 A8-3850 A10-7850KFrequency GHz 227 29 17

          For calculating execution times we have used Linux utilities onthe crypto accelerators whereas for the GPUs we have used NVIDIAand AMD OpenCL profilers to compute the setup kernel and datatransfer times and we report the average of one hundred executionsFor verifying the usage of crypto accelerators we use built-in coun-ters in UltraSPARC T2 and T3 [46] SPARC T4 however no longer

          ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

          Table 4 Algorithmic complexity of various kernels with num-ber of elements and granularity The power of g represents β

          for each kernel

          Kernel Algorithmic ComplexityAdvanced Encryption Standard (AES) O (n) O (g101)Secure Hashing Algorithm (SHA) O (n) O (g097)Matrix Multiplication (GEMM) O (n3) O (g17)Fast Fourier Transform (FFT) O (n logn) O (g12)Radix Sort O (kn) O (g094)Binary Search O (logn) O (g014)

          Table 5 Calculated values of LogCA Parameters

          LogCA ParametersDevice Benchmark L o C A

          (cycles) (cycles) (cyclesB)

          Discrete GPU

          AES 174Radix Sort 290GEMM 3times103 2times108 2 30FFT 290Binary Search 116

          APU

          AES 174Radix Sort 290GEMM 15 4times108 2 7FFT 290Binary Search 116

          UltraSPARC T2 AES 1500 29times104 90 19SHA 105times103 72 12

          SPARC T3 AES 1500 27times104 90 12SHA 105times103 72 10

          SPARC T4 AES 500 435 32 12SHA 16times103 32 10

          SPARC T4 instr AES 4 111 32 12SHA 1638 32 10

          Sandy Bridge AES 3 10 35 6

          supports these counters so we use Linux utilities to trace the execu-tion of the crypto instructions [3] We use these execution times todetermine LogCA parameters We calculate these parameters onceand can be later used for different kernels on the same system

          For computational index and β we profile the CPU code on thehost by varying the granularity from 16B to 32MB At each granu-larity we measure the execution time and use regression analysisto determine C and β For overheads we use the observation thatfor very small granularities the execution time for a kernel on anaccelerator is dominated by the overheads ie limgrarr0 T1 (g) ≃ oFor acceleration we use different methods for the on-chip accelera-tors and GPUs For on-chip accelerators we calculate accelerationusing equation (3) and the observation that the speedup curve flat-tens out and approaches acceleration for very large granularitiesHowever for the GPUs we do not use equation (3) as it requirescomputing acceleration for each kernel as each application has adifferent access pattern which affects the speedup So we boundthe maximum performance using the peak flops from the devicespecifications We use the ratio of peak GFLOPs on CPU and GPUie A = Peak GFLOPGPU

          Peak GFLOPCPU Similar to acceleration we use two different

          techniques for calculating latency For the on-chip accelerators we

          run micro-benchmarks and use execution time on host and acceler-ators On the other hand for the GPUs we compute latency usingpeak memory bandwidth of the GPU Similar to Meswani et al [29]we use the following equation for measuring data copying time forthe GPUs L = 1

          BWpeak

          Earlier we develop our model using assumptions of granularityindependent and dependent latencies In our setup we observe thatthe on-chip crypto accelerators and HSA-enabled GPU representaccelerators with granularity independent latency while the off-chipcrypto accelerator and discrete GPUAPU represent the granular-ity dependent accelerators For each accelerator we calculate thespeedup and performance metrics using the respective equations(sect2)

          5 EVALUATIONIn this section we show that LogCA closely captures the behavior forboth off and on-chip accelerators We also list the calculate LogCAparameters in Table 5 To demonstrate the utility of our modelwe also present two case studies In these studies we consider theevolution of interface in SUNOraclersquos crypto accelerators and threedifferent GPU architectures In both cases we elaborate the designchanges using the insights LogCA provides

          51 Linear-Complexity Kernels (β = 1)Figure 7 shows the curve-fitting of LogCA for AES We considerboth off-chip and on-chip accelerators connected through differentinterfaces ranging from PCIe bus to special instructions We observethat the off-chip accelerators and APU unlike on-chip acceleratorsprovide reasonable speedup only at very large granularities We alsoobserve that the achievable speedup is limited by computationalintensity for off-chip accelerators and acceleration for on-chip accel-erators This observation supports earlier implication on the limitsof speedup for granularity independent and dependent latencies inequation (3) and (9) respectively

          Figure 7 also shows that UltraSPARC T2 provides higher speedupsthan Sandy Bridge but it breaks-even at a larger granularity SandyBridge on the other hand breaks-even at very small granularitybut provides limited speedup The discrete GPU with powerful pro-cessing cores has the highest acceleration among others Howeverits observed speedup is less than others due to high overheads andlatencies involved in communicating through the PCIe bus

          We have also marked g1 and g A2for each accelerator in Figure 7

          which help programmers and architects identify the complexity ofthe interface For example g1 for crypto instructions ie SPARCT4 and Sandy Bridge lies on the extreme left while for the off-chipaccelerators g1 lies on the far right It is worth mentioning that wehave marked g a

          2for on-chip accelerators but not for the off-chip

          accelerators For off-chip accelerators computational intensity isless than acceleration and as we have noted in equation (12) thatg A

          2for these designs does not existWe also observe that g1 for the crypto-card connected through

          the PCIe bus does not exist showing that this accelerator does notbreak-even even for large granularities Figure 7 also shows thatg1 for GPU and APU is comparable This observation shows thatdespite being an integrated GPU and not connected to the PCIe bus

          LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          CL

          Sp

          eed

          up

          (a) PCIe crypto

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100A

          g1

          CL

          (b) NVIDIA Discrete GPU

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          g1 gA2

          CL

          (c) AMD Integrated GPU (APU)

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          g1 gA2

          CL

          (d) UltraSPARC T2

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          g1 gA2

          CL

          Granularity (Bytes)

          Sp

          eed

          up

          (e) SPARC T3

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          g1 gA2

          CL

          Granularity (Bytes)

          (f) SPARC T4 engine

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          gA2

          CL

          g1 lt 16B

          Granularity (Bytes)

          (g) SPARC T4 instruction

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          CL

          g1 gA2lt 16B

          Granularity (Bytes)

          (h) AESNI on Sandy Bridge

          observed LogCA

          Figure 7 Speedup curve fittings plots comparing LogCA with the observed values of AES [30]

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          g1 gA2

          CL

          Granularity (Bytes)

          Sp

          eed

          up

          (a) UltraSPARC T2 engine

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          g1 gA2

          CL

          Granularity (Bytes)

          (b) SPARC T3 engine

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          g1 gA2

          CL

          Granularity (Bytes)

          (c) SPARC T4 engine

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          gA2

          g1 lt 16B

          CL

          Granularity (Bytes)

          (d) SPARC T4 instruction

          observed LogCA

          Figure 8 Speedup curve fittings plots comparing LogCA with the observed values of SHA256 [31] LogCA starts following observedvalues after 64B

          APU spends considerable time in copying data from the host todevice memory

          Figure 8 shows the curve fitting for SHA on various on-chipcrypto accelerators We observe that g1 and g A

          2do exist as all of

          these are on-chip accelerators We also observe that the LogCAcurve mostly follows the observed value However it deviates fromthe observed value before 64B This happens because SHA requiresblock size of 64B for hash computation If the block size is less than64B it pads extra bits to make the block size 64B Since LogCAdoes not capture this effect it does not follow the observed speedupfor granularity smaller than 64B

          Figure 9-a shows the speedup curve fitting plots for Radix sortWe observe that LogCA does not follow observed values for smallergranularities on GPU Despite this inaccuracy LogCA accuratelypredicts g1 and g A

          2 We also observe that g A

          2for GPU is higher than

          APU and this observation supports equation (7) that increasingacceleration increases g A

          2

          52 Super-Linear Complexity Kernels (β gt 1)Figures 9-b and 9-c show the speedup curve fitting plots for super-complexity kernels on discrete GPU and APU We observe that ma-trix multiplication with higher complexity (O (g17)) achieves higherspeedup than sort and FFT with lower complexities of O (g) andO (g12) respectively This observation corroborates results fromequation (9) that achievable speedup of higher-complexity algo-rithms is higher than lower-complexity algorithms We also observethat g A

          2does not exist for FFT This happens because as we note in

          equation (12) that for g A2to exist for FFT C

          L should be greater thanA

          12 However Figure 9-c shows that CL is smaller than A

          12 for bothGPU and APU

          53 Sub-Linear Complexity Kernels (β lt 1)Figure 9-d shows the curve fitting for binary search which is asub-linear algorithm (β = 014) We make three observations Firstg1 does not exist even for very large granularities and C

          L lt 1 Thisobservation supports implication (5) that for a sub-linear algorithm

          ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          g1 gA2

          CL

          Sp

          eed

          up

          GPU

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          g1 gA2

          CL

          APU

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          g1 gA2

          GPU

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          g1 gA2

          CL

          APU

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100A

          g1

          CL

          Granularity (Bytes)

          Sp

          eed

          up

          GPU

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          A

          g1

          CL

          Granularity (Bytes)

          APU

          16 128 1K 8K 64

          K51

          2K 4M 32M

          001

          01

          1

          10

          100A

          g1 gA2

          CL

          Granularity (Bytes)

          GPU

          16 128 1K 8K 64

          K51

          2K 4M 32M

          001

          01

          1

          10

          100

          A

          g1 gA2

          CL

          Granularity (Bytes)

          APU

          (a) Radix Sort (b) Matrix Multiplication

          (c) FFT (d) Binary Search

          observed LogCA

          Figure 9 Speedup curve fittings plots comparing LogCA with the observed values of (a) Radix Sort (b) Matrix Multiplication (c) FFTand (d) Binary Search

          of β = 014 CL should be greater than 7 to provide any speedup

          Second for large granularities speedup starts decreasing with anincrease in granularity This observation supports our earlier claimin implication (4) that for systems with granularity dependent la-tencies speedup for sub-linear algorithms asymptotically decreasesThird LogCA deviates from the observed value at large granularitiesThis deviation occurs because LogCA does not model caches Asmentioned earlier LogCA abstracts the caches and memories witha single parameter of latency which does not capture the memory-access pattern accurately Even though LogCA does not accuratelycaptures binary search behavior it still provides an upper bound onthe achievable performance

          54 Case StudiesFigure 10 shows the evolution of crypto accelerators in SPARCarchitectures from the off-chip accelerators in pre-Niagara (Figure 10(a)) to accelerators integrated within the pipeline in SPARC T4(Figure 10 (e)) We observe that latency is absent in the on-chipacceleratorsrsquo optimization regions as these accelerators are closelycoupled with the host We also note that the optimization regionwith overheadsmdashrepresenting the complexity of an acceleratorrsquosinterfacemdashshrinks while the optimization regions with accelerationexpand from Figure 10 (a-e) For example for the off-chip cryptoaccelerator the cut-off granularity for overheads is 256KB whereasit is 128B for the SPARC T4 suggesting a much simpler interface

          Figure 10 (a) shows the optimization regions for the off-chipcrypto accelerator connected through the PCIe bus We note thatoverheads and latencies occupy most of the optimization regionsindicating high overhead OS calls and high-latency data copyingover the PCIe bus as the bottlenecks

          Figure 10 (b) shows the optimization regions for UltraSPARCT2 The large cut-off granularity for overheads at 32KB suggestsa complex interface indicating high overhead OS call creating abottleneck at small granularities The cut-off granularity of 2KB foracceleration suggests that optimizing acceleration is beneficial atlarge granularities

          Figure 10 (d) shows optimization regions for on-chip acceleratoron SPARC T4 There are three optimization regions with the cut-offgranularity for overhead now reduced to only 512B This observationsuggests a considerable improvement in the interface design overSPARC T3 and it is also evident by a smaller g1 We also note thatcut-off granularity for acceleration now decreases to 32B showingan increase in the opportunity for optimizing acceleration

          Figure 10 (e) shows optimization regions for crypto instructionson SPARC T4 We observe that unlike earlier designs it has only twooptimization regions and the speedup approaches the peak accelera-tion at a small granularity of 128B In contrast UltraSPARC T2 andSPARC T3 do not even provide any gains at this granularity We alsoobserve that the cut-off granularity for overheads further reduces to128B suggesting some opportunity for optimization at very smallgranularities The model also shows that the acceleration occupiesthe maximum range for optimization For example optimizing accel-eration provides benefits for granularities greater than 16B The lowoverhead access which LogCA shows is due to the non-privilegedinstruction SPARC T4 uses to access the cryptographic unit whichis integrated within the pipeline

          Figure 11 shows the evolution of memory interface design inGPU architectures It shows the optimization regions for matrixmultiplication on a discrete NVIDIA GPU an AMD integrated GPU(APU) and an integrated AMD GPU with HSA support We observethat matrix multiplication for all three architectures is compute bound

          LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          1000

          A

          CL

          oCLoC

          LC

          oL

          Granularity (Bytes)

          Sp

          eed

          up

          (a) PCIe Crypto Accelerator

          16 128 g1 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          1000

          A

          CL

          oC AoCA

          oA

          Granularity (Bytes)

          (b) UltraSPARC T2

          16 128 g1 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          1000CL

          A

          oC AoCA

          oA

          Granularity (Bytes)

          (c) SPARC T3

          g112

          8 1K 8K 64K

          512K 4M 32

          M

          01

          1

          10

          100

          1000

          A

          oCA A

          CL

          oA

          Granularity (Bytes)

          Sp

          eed

          up

          (d) SPARC T4 engine

          16 128 1K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          1000

          A

          oCA A

          CL

          oA

          Granularity (Bytes)

          (e) SPARC T4 instruction

          LogCA L110xo110x C10x A10x

          Figure 10 LogCA for performing Advanced Encryption Standard on various crypto accelerators LogCA identifies the design bottle-necks through LogCA parameters in an optimization region The bottlenecks which LogCA suggests in each design is optimized inthe next design

          16 128 1K 8K g1

          64K

          512K 4M 32

          M

          01

          1

          10

          100

          1000

          A

          LoC LCALC A

          ALC

          o

          Granularity (Bytes)

          Sp

          eed

          up

          (a) NVIDIA Discrete GPU

          16 128 1K 8K g1

          64K

          512K 4M 32

          M

          01

          1

          10

          100

          1000

          A

          LoCLo

          CA

          ACA

          AoL

          C

          Granularity (Bytes)

          (b) AMD Integrated GPU (APU)

          16 128 g11K 8K 64

          K51

          2K 4M 32M

          01

          1

          10

          100

          1000

          A

          oC o

          CA

          CA

          o

          AC

          Granularity (Bytes)

          (c) HSA supported AMD Integrated GPU

          LogCA L110xo110x C10x A10x

          Figure 11 Various Optimization regions for matrix multiplication over a range of granularities on (a) NVIDIA discrete GPU (b)AMD APU and (c) HSA Supported GPU

          (sect31) We also observe that the computational index occupies mostof the regions which signifies maximum optimization potential

          The discrete GPU has four optimization regions (Figure 11 (a))Among these latency dominates most of the regions signifyinghigh-latency data copying over the PCIe bus and thus maximumoptimization potential The high cut-off granularity for overheads at

          32KB indicates high overhead OS calls to access the GPU Similarlywith highly aggressive cores acceleration has high cut-off granular-ity of 256KB indicating less optimization potential for acceleration

          Similar to the discrete GPU the APU also has four optimiza-tion regions (Figure 11 (b)) There are few notable differences ascompared to the discrete GPU The cut-off granularity for latency

          ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

          reduces to 512KB with the elimination of data copying over thePCIe bus the overheads are still high suggesting high overhead OScalls to access the APU with less aggressive cores the cut-off granu-larity for acceleration reduces to 64KB implying more optimizationpotential for acceleration

          Figure 11 (c) shows three optimization regions for the HSA en-abled integrated GPU We observe that latency is absent in all regionsand the cut-off granularity for overhead reduces to 8KB These re-ductions in overheads and latencies signify a simpler interface ascompared to the discrete GPU and APU We also observe that thecut-off granularity for acceleration drops to 2KB suggesting higherpotential for optimizing acceleration

          6 RELATED WORKWe compare and contrast our work with prior approaches Lopez-Novoa et al [28] provide a detailed survey of various acceleratormodeling techniques We broadly classify these techniques in twocategories and discuss the most relevant work

          Analytical Models There is a rich body of work exploring ana-lytical models for performance prediction of accelerators For somemodels the motivation is to determine the future trend in heteroge-neous architectures Chung et al [7] in a detailed study predict thefuture landscape of heterogeneous computing Hempstead et al [17]propose an early-stage model Navigo that determines the fraction ofarea required for accelerators to maintain the traditional performancetrend Nilakantan et al [32] propose to incorporate communicationcost for early-stage model of accelerator-rich architectures For oth-ers the motivation is to determine the right amount of data to offloadand the potential benefits associated with an accelerator [24]

          Some analytical models are architecture specific For examplea number of studies [20 21 44 57] predict performance of GPUarchitectures Hong et al [20] present an analytical performancemodel for predicting execution time on GPUs They later extendtheir model and develop an integrated power and performance modelfor the GPUs [21] Song et al [44] use a simple counter basedapproach to predict power and performance Meswani et al [29]explore such models for high performance applications Daga etal [11] analyze the effectiveness of Accelerated processing units(APU) over GPUs and describe the communication cost over thePCIe bus as a major bottleneck in exploiting the full potential ofGPUs

          In general our work is different from these studies because ofthe complexity These models use a large number of parameters toaccurately predict the power andor performance whereas we limitthe number of parameters to reduce the complexity of our modelThey also require deep understanding of the underlying architectureMost of these models also require access to GPU specific assemblyor PTX codes Unlike these approaches we use CPU code to providebounds on the performance

          Roofline Models In terms of simplicity and motivation our workclosely matches the Roofline model [54]mdasha visual performancemodel for multi-core architectures Roofline exposes bottlenecks fora kernel and suggests several optimizations which programmers canuse to fine tune the kernel on a given system

          A number of extensions of Roofline have been proposed [1022 33 56] and some of these extensions are architecture specific

          For example targeting GPUs [22] vector processors [39] and FP-GAs [10 56]

          Despite the similarities roofline and its extensions cannot be usedfor exposing design bottlenecks in an acceleratorrsquos interface Theprimary goal of roofline models has been to help programmers andcompiler writer while LogCA provides more insights for architects

          7 CONCLUSION AND FUTURE WORKWith the recent trend towards heterogeneous computing we feelthat the architecture community lacks a model to reason about theneed of accelerators In this respect we propose LogCAmdashan insight-ful visual performance model for hardware accelerators LogCAprovides insights early in the design stage to both architects andprogrammers and identifies performance bounds exposes interfacedesign bottlenecks and suggest optimizations to alleviate these bot-tlenecks We have validated our model across a range of on-chip andoff-chip accelerators and have shown its utility using retrospectivestudies describing the evolution of acceleratorrsquos interface in thesearchitectures

          The applicability of LogCA can be limited by our simplifying as-sumptions and for more realistic analysis we plan to overcome theselimitations in our future work For example We also assume a singleaccelerator system and do not explicitly model contention amongresources Our model should handle multi-accelerator and pipelinedscenarios For fixed function accelerators our design space is cur-rently limited to encryption and hashing kernels To overcome thiswe are expanding our design space with compression and databaseaccelerators in Oracle M7 processor We also plan to complementLogCA with an energy model as energy efficiency is a prime designmetric for accelerators

          ACKNOWLEDGEMENTSWe thank our anonymous reviewers Arkaprava Basu and TonyNowatzki for their insightful comments and feedback on the paperThanks to Mark Hill Michael Swift Wisconsin Computer Archi-tecture Affiliates and other members of the Multifacet group fortheir valuable discussions We also thank Brian Wilson at Universityof Wisconsin DoIT and Eric Sedlar at Oracle Labs for providingaccess to SPARC T3 and T4 servers respectively Also thanks toMuhammad Umair Bin Altaf for his help in the formulation Thiswork is supported in part by the National Science Foundation (CNS-1302260 CCF-1438992 CCF-1533885 CCF- 1617824) Googleand the University of Wisconsin-Madison (Amar and Balindar SohiProfessorship in Computer Science) Wood has a significant financialinterest in AMD and Google

          REFERENCES[1] Advanced Micro Devices 2016 APP SDK - A Complete Development Platform

          Advanced Micro Devices httpdeveloperamdcomtools-and-sdksopencl-zoneamd-accelerated-parallel-processing-app-sdk

          [2] Gene M Amdahl 1967 Validity of the single processor approach to achievinglarge scale computing capabilities Proceedings of the April 18-20 1967 springjoint computer conference on - AFIPS rsquo67 (Spring) (1967) 483 httpsdoiorg10114514654821465560

          [3] Dan Anderson 2012 How to tell if SPARC T4 crypto is being used httpsblogsoraclecomDanXentryhow_to_tell_if_sparc

          [4] Krste Asanovic Rastislav Bodik James Demmel Tony Keaveny Kurt KeutzerJohn Kubiatowicz Nelson Morgan David Patterson Koushik Sen John

          LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

          Wawrzynek David Wessel and Katherine Yelick 2009 A View of the Par-allel Computing Landscape Commun ACM 52 10 (oct 2009) 56ndash67 httpsdoiorg10114515627641562783

          [5] Nathan Beckmann and Daniel Sanchez 2016 Cache Calculus Modeling Cachesthrough Differential Equations Computer Architecture Letters PP 99 (2016)1 httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=7366753$delimiter026E30F$npapers3publicationdoi101109LCA20152512873

          [6] C Cascaval S Chatterjee H Franke K J Gildea and P Pattnaik 2010 Ataxonomy of accelerator architectures and their programming models IBMJournal of Research and Development 54 (2010) 51ndash510 httpsdoiorg101147JRD20102059721

          [7] Eric S Chung Peter a Milder James C Hoe and Ken Mai 2010 Single-ChipHeterogeneous Computing Does the Future Include Custom Logic FPGAs andGPGPUs 2010 43rd Annual IEEEACM International Symposium on Microar-chitecture (dec 2010) 225ndash236 httpsdoiorg101109MICRO201036

          [8] Jason Cong Zhenman Fang Michael Gill and Glenn Reinman 2015 PARADEA Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Archi-tectural Design and Exploration In 2015 IEEEACM International Conference onComputer-Aided Design Austin TX

          [9] D Culler R Karp D Patterson and A Sahay 1993 LogP Towards a realisticmodel of parallel computation In Proceedings of the Fourth ACM SIGPLANSymposium on Principles and Practice of Parallel Programming 1ndash12 httpdlacmorgcitationcfmid=155333

          [10] Bruno da Silva An Braeken Erik H DrsquoHollander and Abdellah Touhafi 2013Performance Modeling for FPGAs Extending the Roofline Model with High-level Synthesis Tools Int J Reconfig Comput 2013 (jan 2013) 77mdash-77httpsdoiorg1011552013428078

          [11] Mayank Daga Ashwin M Aji and Wu-chun Feng 2011 On the Efficacy of aFused CPU+GPU Processor (or APU) for Parallel Computing In 2011 Symposiumon Application Accelerators in High-Performance Computing Ieee 141ndash149httpsdoiorg101109SAAHPC201129

          [12] Hadi Esmaeilzadeh Emily Blem Renee St Amant Karthikeyan Sankaralingamand Doug Burger 2011 Dark silicon and the end of multicore scaling InProceeding of the 38th annual international symposium on Computer archi-tecture - ISCA rsquo11 ACM Press New York New York USA 365 httpsdoiorg10114520000642000108

          [13] H Franke J Xenidis C Basso B M Bass S S Woodward J D Brown andC L Johnson 2010 Introduction to the wire-speed processor and architectureIBM Journal of Research and Development 54 (2010) 31ndash311 httpsdoiorg101147JRD20092036980

          [14] Venkatraman Govindaraju Chen Han Ho and Karthikeyan Sankaralingam 2011Dynamically specialized datapaths for energy efficient computing In Proceedings- International Symposium on High-Performance Computer Architecture 503ndash514httpsdoiorg101109HPCA20115749755

          [15] Shay Gueron 2012 Intel Advanced Encryption Standard (AES) Instructions SetTechnical Report Intel Corporation httpssoftwareintelcomsitesdefaultfilesarticle165683aes-wp-2012-09-22-v01pdf

          [16] Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex SolomatnikovBenjamin C Lee Stephen Richardson Christos Kozyrakis and Mark Horowitz2010 Understanding sources of inefficiency in general-purpose chips Proceed-ings of the 37th annual international symposium on Computer architecture - ISCA

          rsquo10 (2010) 37 httpsdoiorg10114518159611815968[17] Mark Hempstead Gu-Yeon Wei and David Brooks 2009 Navigo An early-

          stage model to study power-constrained architectures and specialization In ISCAWorkshop on Modeling Benchmarking and Simulations (MoBS)

          [18] John L Hennessy and David A Patterson 2006 Computer Architecture FourthEdition A Quantitative Approach 704 pages httpsdoiorg10111151881

          [19] Mark D Hill and Michael R Marty 2008 Amdahlrsquos Law in the Multicore EraComputer 41 7 (jul 2008) 33ndash38 httpsdoiorg101109MC2008209

          [20] Sunpyo Hong and Hyesoon Kim 2009 An analytical model for a GPU architec-ture with memory-level and thread-level parallelism awareness In Proceedingsof the 36th Annual International Symposium on Computer Architecture Vol 37152ndash163 httpsdoiorg10114515558151555775

          [21] Sunpyo Hong and Hyesoon Kim 2010 An integrated GPU power and perfor-mance model In Proceedings of the 37th Annual International Symposium onComputer Architecture Vol 38 280mdash-289 httpsdoiorg10114518160381815998

          [22] Haipeng Jia Yunquan Zhang Guoping Long Jianliang Xu Shengen Yan andYan Li 2012 GPURoofline A Model for Guiding Performance Optimizations onGPUs In Proceedings of the 18th International Conference on Parallel Processing(Euro-Parrsquo12) Springer-Verlag Berlin Heidelberg 920ndash932 httpsdoiorg101007978-3-642-32820-6_90

          [23] Onur Kocberber Boris Grot Javier Picorel Babak Falsafi Kevin Lim andParthasarathy Ranganathan 2013 Meet the Walkers Accelerating Index Traver-sals for In-memory Databases In Proceedings of the 46th Annual IEEEACMInternational Symposium on Microarchitecture (MICRO-46) ACM New YorkNY USA 468ndash479 httpsdoiorg10114525407082540748

          [24] Karthik Kumar Jibang Liu Yung Hsiang Lu and Bharat Bhargava 2013 Asurvey of computation offloading for mobile systems Mobile Networks andApplications 18 (2013) 129ndash140 httpsdoiorg101007s11036-012-0368-0

          [25] Snehasish Kumar Naveen Vedula Arrvindh Shriraman and Vijayalakshmi Srini-vasan 2015 DASX Hardware Accelerator for Software Data Structures InProceedings of the 29th ACM on International Conference on Supercomputing(ICS rsquo15) ACM New York NY USA 361ndash372 httpsdoiorg10114527512052751231

          [26] Maysam Lavasani Hari Angepat and Derek Chiou 2014 An FPGA-based In-Line Accelerator for Memcached IEEE Comput Archit Lett 13 2 (jul 2014)57ndash60 httpsdoiorg101109L-CA201317

          [27] John D C Little and Stephen C Graves 2008 Littlersquos law In Building intuitionSpringer 81ndash100

          [28] U Lopez-Novoa A Mendiburu and J Miguel-Alonso 2015 A Survey of Perfor-mance Modeling and Simulation Techniques for Accelerator-Based ComputingParallel and Distributed Systems IEEE Transactions on 26 1 (jan 2015) 272ndash281httpsdoiorg101109TPDS20142308216

          [29] M R Meswani L Carrington D Unat A Snavely S Baden and S Poole2013 Modeling and predicting performance of high performance computingapplications on hardware accelerators International Journal of High Perfor-mance Computing Applications 27 (2013) 89ndash108 httpsdoiorg1011771094342012468180

          [30] National Institute of Standards and Technology 2001 Advanced EncryptionStandard (AES) National Institute of Standards and Technology httpsdoiorg106028NISTFIPS197

          [31] National Institute of Standards and Technology 2008 Secure Hash StandardNational Institute of Standards and Technology httpcsrcnistgovpublicationsfipsfips180-3fips180-3_finalpdf

          [32] S Nilakantan S Battle and M Hempstead 2013 Metrics for Early-Stage Model-ing of Many-Accelerator Architectures Computer Architecture Letters 12 1 (jan2013) 25ndash28 httpsdoiorg101109L-CA20129

          [33] Cedric Nugteren and Henk Corporaal 2012 The Boat Hull Model EnablingPerformance Prediction for Parallel Computing Prior to Code Development Cate-gories and Subject Descriptors In Proceedings of the 9th Conference on Comput-ing Frontiers ACM 203mdash-212

          [34] OpenSSL Software Foundation 2015 OpenSSL Cryptography and SSLTLSToolkit OpenSSL Software Foundation httpsopensslorg

          [35] Sanjay Patel 2009 Sunrsquos Next-Generation Multithreaded Processor RainbowFalls In 21st Hot Chip Symposium httpwwwhotchipsorgwp-contentuploadshc

          [36] Sanjay Patel and Wen-mei W Hwu 2008 Accelerator Architectures IEEE Micro28 4 (jul 2008) 4ndash12 httpsdoiorg101109MM200850

          [37] Stephen Phillips 2014 M7 Next Generation SPARC In 26th Hot Chip Sympo-sium

          [38] Phil Rogers 2013 Heterogeneous system architecture overview In Hot Chips[39] Yoshiei Sato Ryuichi Nagaoka Akihiro Musa Ryusuke Egawa Hiroyuki Tak-

          izawa Koki Okabe and Hiroaki Kobayashi 2009 Performance tuning andanalysis of future vector processors based on the roofline model Proceedingsof the 10th MEDEA workshop on MEmory performance DEaling with Applica-tions systems and architecture - MEDEA rsquo09 (2009) 7 httpsdoiorg10114516219601621962

          [40] M Shah J Barren J Brooks R Golla G Grohoski N Gura R Hetherington PJordan M Luttrell C Olson B Sana D Sheahan L Spracklen and A Wynn2007 UltraSPARC T2 A highly-treaded power-efficient SPARC SOC InSolid-State Circuits Conference 2007 ASSCC rsquo07 IEEE Asian 22ndash25 httpsdoiorg101109ASSCC20074425786

          [41] Manish Shah Robert Golla Gregory Grohoski Paul Jordan Jama Barreh JeffreyBrooks Mark Greenberg Gideon Levinsky Mark Luttrell Christopher OlsonZeid Samoail Matt Smittle and Thomas Ziaja 2012 Sparc T4 A dynamicallythreaded server-on-a-chip IEEE Micro 32 (2012) 8ndash19 httpsdoiorg101109MM20121

          [42] Yakun Sophia Shao Brandon Reagen Gu-Yeon Wei and David Brooks 2014Aladdin A Pre-RTL Power-Performance Accelerator Simulator Enabling LargeDesign Space Exploration of Customized Architectures In International Sympo-sium on Computer Architecture (ISCA)

          [43] Soekris Engineering 2016 vpn 1401 for Std PCI-sockets Soekris Engineeringhttpsoekriscomproductsvpn-1401html

          [44] Shuaiwen Song Chunyi Su Barry Rountree and Kirk W Cameron 2013 Asimplified and accurate model of power-performance efficiency on emergent GPUarchitectures In Proceedings - IEEE 27th International Parallel and DistributedProcessing Symposium IPDPS 2013 673ndash686 httpsdoiorg101109IPDPS201373

          [45] Jeff Stuecheli 2013 POWER8 In 25th Hot Chip Symposium[46] Ning Sun and Chi-Chang Lin 2007 Using the Cryptographic Accelerators in the

          UltraSPARC T1 and T2 processors Technical Report httpwwworaclecomtechnetworkserver-storagesolarisdocumentation819-5782-150147pdf

          ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

          [47] S Tabik G Ortega and E M Garzoacuten 2014 Performance evaluation of ker-nel fusion BLAS routines on the GPU iterative solvers as case study TheJournal of Supercomputing 70 2 (nov 2014) 577ndash587 httpsdoiorg101007s11227-014-1102-4

          [48] Y C Tay 2013 Analytical Performance Modeling for Computer Systems (2nded) Morgan amp Claypool Publishers

          [49] MB Taylor 2012 Is dark silicon useful harnessing the four horsemen of thecoming dark silicon apocalypse In Design Automation Conference (DAC) 201249th ACMEDACIEEE 1131ndash1136 httpsdoiorg10114522283602228567

          [50] G Venkatesh J Sampson N Goulding S Garcia V Bryksin J Lugo-MartinezS Swanson and M B Taylor 2010 Conservation cores Reducing the energyof mature computations In International Conference on Architectural Supportfor Programming Languages and Operating Systems - ASPLOS 205ndash218 httpsdoiorg10114517360201736044

          [51] Ganesh Venkatesh Jack Sampson Nathan Goulding-Hotta Sravanthi KotaVenkata Michael Bedford Taylor and Steven Swanson 2011 QsCores Trad-ing Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores InProceedings of the 44th Annual IEEEACM International Symposium on Microar-chitecture - MICRO-44 rsquo11 163 httpsdoiorg10114521556202155640

          [52] Guibin Wang Yisong Lin and Wei Yi 2010 Kernel Fusion An EffectiveMethod for Better Power Efficiency on Multithreaded GPU In Green Computingand Communications (GreenCom) 2010 IEEEACM Intrsquol Conference on CyberPhysical and Social Computing (CPSCom) 344ndash350 httpsdoiorg101109GreenCom-CPSCom2010102

          [53] Eric W Weisstein 2015 Newtonrsquos Method From MathWorld ndash A Wolfram WebResource httpmathworldwolframcomNewtonsMethodhtml

          [54] Samuel Williams Andrew Waterman and David Patterson 2009 Roofline aninsightful visual performance model for multicore architectures Commun ACM52 (2009) 65ndash76 httpsdoiorg10114514987651498785

          [55] Lisa Wu Andrea Lottarini Timothy K Paine Martha A Kim and Kenneth ARoss 2014 Q100 The Architecture and Design of a Database Processing UnitIn Proceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS rsquo14) ACM NewYork NY USA 255ndash268 httpsdoiorg10114525419402541961

          [56] Moein Pahlavan Yali 2014 FPGA-Roofline An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems PhD DissertationVirginia Polytechnic Institute and State University

          [57] Yao Zhang and John D Owens 2011 A quantitative performance analysismodel for GPU architectures In Proceedings - International Symposium on High-Performance Computer Architecture 382ndash393 httpsdoiorg101109HPCA20115749745

          • Abstract
          • 1 Introduction
          • 2 The LogCA Model
            • 21 Effect of Granularity
            • 22 Performance Metrics
            • 23 Granularity dependent latency
              • 3 Applications of LogCA
                • 31 Performance Bounds
                • 32 Sensitivity Analysis
                  • 4 Experimental Methodology
                  • 5 Evaluation
                    • 51 Linear-Complexity Kernels (= 1)
                    • 52 Super-Linear Complexity Kernels (gt 1)
                    • 53 Sub-Linear Complexity Kernels (lt 1)
                    • 54 Case Studies
                      • 6 Related Work
                      • 7 Conclusion and Future Work
                      • References

            ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            1000

            A

            No variation

            Granularity (Bytes)

            Sp

            eed

            up

            (a) Latency

            LogCAL110x

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            1000

            A

            Granularity (Bytes)

            (b) Overheads

            LogCAo110x

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            1000

            A

            Granularity (Bytes)

            (c) Computational Index

            LogCAC10x

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            1000

            A

            CL

            Granularity (Bytes)

            (d) Acceleration

            LogCAA10x

            Figure 5 The effect on speedup of 10x improvement in each LogCA parameter The base case is the speedup of AES [30] on Ultra-SPARC T2

            3 APPLICATIONS OF LogCAIn this section we describe the utility of LogCA for visually iden-tifying the performance bounds design bottlenecks and possibleoptimizations to alleviate these bottlenecks

            31 Performance BoundsEarlier we have observed that the speedup is bounded by eitheracceleration (equation 3) or the product of computational intensityand gβminus1 (equation 9) Using these observations we classify kernelseither as compute-bound or latency-bound For compute-bound ker-nels the achievable speedup is bounded by acceleration whereas forthe latency-bound kernels the speedup is bounded by computationalintensity Based on this classification a compute-bound kernel caneither be running on a system with granularity independent latencyor has super-linear complexity while running on a system with gran-ularity dependent latency Figure 4-a illustrates these bounds forcompute-bound kernels On the other hand a latency-bound kernelis running on a system with granularity dependent latency and haseither linear or sub-linear complexity Figure 4-b illustrates thesebounds for latency-bound kernels

            Programmers and architects can visually identify these boundsand use this information to invest their time and resources in the rightdirection For example for compute-bound kernelsmdashdependingon the operating granularitymdashit may be beneficial to invest moreresources in either increasing acceleration or reducing overheadsHowever for latency-bound kernels optimizing acceleration andoverheads is not that critical but decreasing latency and increasingcomputational index maybe more beneficial

            32 Sensitivity AnalysisTo identify the design bottlenecks we perform a sensitivity analysisof the LogCA parameters We consider a parameter a design bottle-neck if a 10x improvement in it provides at lest 20 improvement inspeedup A lsquobottleneckedrsquo parameter also provides an optimizationopportunity To visually identify these bottlenecks we introduceoptimization regions As an example we identify design bottlenecksin UltraSPARC T2rsquos crypto accelerator by varying its individualparameters 2 in Figure 5 (a)-(d)

            2We elaborate our methodology for measuring LogCA parameters later (sect 4)

            Figure 5 (a) shows the variation (or the lack of) in speedup withthe decrease in latency The resulting gains are negligible and inde-pendent of the granularity as it is a closely coupled accelerator

            Figure 5 (b) shows the resulting speedup after reducing overheadsSince the overheads are one-time initialization cost and independentof granularity the per byte setup cost is high at small granularitiesDecreasing these overheads considerably reduces the per byte setupcost and results in significant gains at these smaller granularitiesConversely for larger granularities the per byte setup cost is alreadyamortized so reducing overheads does not provide much gainsThus overhead is a bottleneck at small granularities and provide anopportunity for optimization

            Figure 5 (c) shows the effect of increasing the computationalindex The results are similar to optimizing overheads in Figure 5 (b)ie significant gains for small granularities and a gradual decreasein the gains with increasing granularity With the constant overheadsincreasing computational index increases the computation time of thekernel and decreases the per byte setup cost For smaller granularitiesthe reduced per byte setup cost results in significant gains

            Figure 5 (d) shows the variation in speedup with increasing peakacceleration The gains are negligible at small granularities andbecome significant for large granularities As mentioned earlierthe per byte setup cost is high at small granularities and it reducesfor large granularities Since increasing peak acceleration does notreduce the per byte setup cost optimizing peak acceleration providesgains only at large granularities

            We group these individual sensitivity plots in Figure 6 to buildthe optimization regions As mentioned earlier each region indicatesthe potential of 20 gains with 10x variation of one or more LogCAparameters For the ease of understanding we color these regionsand label them with their respective LogCA parameters For exam-ple the blue colored region labelled lsquooCrsquo (16B to 2KB) indicatesan optimization region where optimizing overheads and computa-tional index is beneficial Similarly the red colored region labelledlsquoArsquo (32KB to 32MB) represents an optimization region where opti-mizing peak acceleration is only beneficial The granularity rangeoccupied by a parameter also identifies the scope of optimizationfor an architect and a programmer For example for UltraSPARCT2 overheads occupy most of the lower granularity suggesting op-portunity for improving the interface Similarly the absence of thelatency parameter suggests little benefits for optimizing latency

            We also add horizontal arrows to the optimization regions inFigure 6 to demarcate the start and end of granularity range for each

            LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

            Table 2 Description of the Cryptographic accelerators

            Crypto Accelerator PCI Crypto UltraSPARC T2 SPARC T3 SPARC T4 Sandy BridgeProcessor AMD A8-3850 S2 S2 S3 Intel Core i7-2600Frequency 29 GHz 116 GHz 165 GHz 3 GHz 34 GHzOpenSSL version 098o 098o 098o 102 101k 098oKernel Ubuntu 3130-55 Oracle Solaris 11 Oracle Solaris 11 Oracle Solaris 112 Linux2632-504

            16 128 g1 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            1000

            A

            CL

            oC AoCA

            oC

            A

            Granularity (Bytes)

            Sp

            eed

            up

            LogCA L110xo110x C10x A10x

            Figure 6 Optimization regions for UltraSPARC T2 The pres-ence of a parameter in an optimization region indicates thatit can at least provides 20 gains The horizontal arrow in-dicates the cut-off granularity at which a parameter provides20 gains

            parameter For example optimizing acceleration starts providingbenefits from 2KB while optimizing overheads or computationalindex is beneficial up till 32KB These arrows also indicate thecut-off granularity for each parameter These cut-off granularitiesprovide insights to architects and programmers about the designbottlenecks For example high cut-off granularity of 32KB suggestshigh overheads and thus a potential for optimization

            4 EXPERIMENTAL METHODOLOGYThis section describes the experimental setup and benchmarks forvalidating LogCA on real machines We also discuss our methodol-ogy for measuring LogCA parameters and performance metrics

            Our experimental setup comprises of on-chip and off-chip cryptoaccelerators (Table 2) and three different GPUs (Table 3) The on-chip crypto accelerators include cryptographic units on SunOracleUltraSPARC T2 [40] SPARC T3 [35] SPARC T4 [41] and AES-NI(AES New Instruction) [15] on Sandy Bridge whereas the off-chipaccelerator is a Hifn 7955 chip connected through the PCIe bus [43]The GPUs include a discrete NVIDIA GPU an integrated AMDGPU (APU) and HSA supported integrated GPU

            For the on-chip crypto accelerators each core in UltraSPARC T2and SPARC T3 has a physically addressed crypto unit which requiresprivileged DMA calls However the crypto unit on SPARC T4 isintegrated within the pipeline and does not require privileged DMAcalls SPARC T4 also provides non-privileged crypto instructions toaccess the crypto unit Similar to SPARC T4 sandy bridge providesnon-privileged crypto instructionmdashAESNI

            Considering the GPUs the discrete GPU is connected throughthe PCIe bus whereas for the APU the GPU is co-located with thehost processor on the same die For the APU the system memoryis partitioned between host and GPU memory This eliminates thePCIe bottleneck of data copying but it still requires copying databetween memories Unlike discrete GPU and APU HSA supportedGPU provides a unified and coherent view of the system memoryWith the host and GPU share the same virtual address space explicitcopying of data between memories is not required

            Our workloads consist of encryption hashing and GPU kernelsFor encryption and hashing we have used advanced encryptionstandard (AES) [30] and standard hashing algorithm (SHA) [31]respectively from OpenSSL [34]mdashan open source cryptography li-brary For GPU kernels we use matrix multiplication radix sortFFT and binary search from AMD OpenCL SDK [1] Table 4 we listthe complexities of each kernel both in terms of number of elementsn and granularity g We expect these complexities to remain same inboth cases but we observe that they differ for matrix multiplicationFor example for a square matrix of size n matrix multiplication hascomplexity of O (n3) whereas the complexity in terms of granularityis O (g17) This happens because for matrix multiplicationmdashunlikeothersmdashcomputations are performed on matrices and not vectorsSo offloading a square matrix of size n corresponds to offloading n2

            elements which results in the apparent discrepancy in the complexi-ties We also observe that for the granularity range of 16B to 32MBβ = 011 provides a close approximation for log(g)

            Table 3 Description of the GPUs

            Platform Discrete GPU Integrated APU AMD HSAName Tesla C2070 Radeon HD 6550 Radeon R7Architecture Fermi Beaver Creek KaveriCores 16 5 8Compute Units 448 400 512Clock Freq 15 GHz 600 MHz 720 MHzPeak FLOPS 1 T 480 G 856 GHostProcessor Intel AMD AMD

            Xeon E5520 A8-3850 A10-7850KFrequency GHz 227 29 17

            For calculating execution times we have used Linux utilities onthe crypto accelerators whereas for the GPUs we have used NVIDIAand AMD OpenCL profilers to compute the setup kernel and datatransfer times and we report the average of one hundred executionsFor verifying the usage of crypto accelerators we use built-in coun-ters in UltraSPARC T2 and T3 [46] SPARC T4 however no longer

            ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

            Table 4 Algorithmic complexity of various kernels with num-ber of elements and granularity The power of g represents β

            for each kernel

            Kernel Algorithmic ComplexityAdvanced Encryption Standard (AES) O (n) O (g101)Secure Hashing Algorithm (SHA) O (n) O (g097)Matrix Multiplication (GEMM) O (n3) O (g17)Fast Fourier Transform (FFT) O (n logn) O (g12)Radix Sort O (kn) O (g094)Binary Search O (logn) O (g014)

            Table 5 Calculated values of LogCA Parameters

            LogCA ParametersDevice Benchmark L o C A

            (cycles) (cycles) (cyclesB)

            Discrete GPU

            AES 174Radix Sort 290GEMM 3times103 2times108 2 30FFT 290Binary Search 116

            APU

            AES 174Radix Sort 290GEMM 15 4times108 2 7FFT 290Binary Search 116

            UltraSPARC T2 AES 1500 29times104 90 19SHA 105times103 72 12

            SPARC T3 AES 1500 27times104 90 12SHA 105times103 72 10

            SPARC T4 AES 500 435 32 12SHA 16times103 32 10

            SPARC T4 instr AES 4 111 32 12SHA 1638 32 10

            Sandy Bridge AES 3 10 35 6

            supports these counters so we use Linux utilities to trace the execu-tion of the crypto instructions [3] We use these execution times todetermine LogCA parameters We calculate these parameters onceand can be later used for different kernels on the same system

            For computational index and β we profile the CPU code on thehost by varying the granularity from 16B to 32MB At each granu-larity we measure the execution time and use regression analysisto determine C and β For overheads we use the observation thatfor very small granularities the execution time for a kernel on anaccelerator is dominated by the overheads ie limgrarr0 T1 (g) ≃ oFor acceleration we use different methods for the on-chip accelera-tors and GPUs For on-chip accelerators we calculate accelerationusing equation (3) and the observation that the speedup curve flat-tens out and approaches acceleration for very large granularitiesHowever for the GPUs we do not use equation (3) as it requirescomputing acceleration for each kernel as each application has adifferent access pattern which affects the speedup So we boundthe maximum performance using the peak flops from the devicespecifications We use the ratio of peak GFLOPs on CPU and GPUie A = Peak GFLOPGPU

            Peak GFLOPCPU Similar to acceleration we use two different

            techniques for calculating latency For the on-chip accelerators we

            run micro-benchmarks and use execution time on host and acceler-ators On the other hand for the GPUs we compute latency usingpeak memory bandwidth of the GPU Similar to Meswani et al [29]we use the following equation for measuring data copying time forthe GPUs L = 1

            BWpeak

            Earlier we develop our model using assumptions of granularityindependent and dependent latencies In our setup we observe thatthe on-chip crypto accelerators and HSA-enabled GPU representaccelerators with granularity independent latency while the off-chipcrypto accelerator and discrete GPUAPU represent the granular-ity dependent accelerators For each accelerator we calculate thespeedup and performance metrics using the respective equations(sect2)

            5 EVALUATIONIn this section we show that LogCA closely captures the behavior forboth off and on-chip accelerators We also list the calculate LogCAparameters in Table 5 To demonstrate the utility of our modelwe also present two case studies In these studies we consider theevolution of interface in SUNOraclersquos crypto accelerators and threedifferent GPU architectures In both cases we elaborate the designchanges using the insights LogCA provides

            51 Linear-Complexity Kernels (β = 1)Figure 7 shows the curve-fitting of LogCA for AES We considerboth off-chip and on-chip accelerators connected through differentinterfaces ranging from PCIe bus to special instructions We observethat the off-chip accelerators and APU unlike on-chip acceleratorsprovide reasonable speedup only at very large granularities We alsoobserve that the achievable speedup is limited by computationalintensity for off-chip accelerators and acceleration for on-chip accel-erators This observation supports earlier implication on the limitsof speedup for granularity independent and dependent latencies inequation (3) and (9) respectively

            Figure 7 also shows that UltraSPARC T2 provides higher speedupsthan Sandy Bridge but it breaks-even at a larger granularity SandyBridge on the other hand breaks-even at very small granularitybut provides limited speedup The discrete GPU with powerful pro-cessing cores has the highest acceleration among others Howeverits observed speedup is less than others due to high overheads andlatencies involved in communicating through the PCIe bus

            We have also marked g1 and g A2for each accelerator in Figure 7

            which help programmers and architects identify the complexity ofthe interface For example g1 for crypto instructions ie SPARCT4 and Sandy Bridge lies on the extreme left while for the off-chipaccelerators g1 lies on the far right It is worth mentioning that wehave marked g a

            2for on-chip accelerators but not for the off-chip

            accelerators For off-chip accelerators computational intensity isless than acceleration and as we have noted in equation (12) thatg A

            2for these designs does not existWe also observe that g1 for the crypto-card connected through

            the PCIe bus does not exist showing that this accelerator does notbreak-even even for large granularities Figure 7 also shows thatg1 for GPU and APU is comparable This observation shows thatdespite being an integrated GPU and not connected to the PCIe bus

            LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            CL

            Sp

            eed

            up

            (a) PCIe crypto

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100A

            g1

            CL

            (b) NVIDIA Discrete GPU

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            g1 gA2

            CL

            (c) AMD Integrated GPU (APU)

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            g1 gA2

            CL

            (d) UltraSPARC T2

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            g1 gA2

            CL

            Granularity (Bytes)

            Sp

            eed

            up

            (e) SPARC T3

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            g1 gA2

            CL

            Granularity (Bytes)

            (f) SPARC T4 engine

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            gA2

            CL

            g1 lt 16B

            Granularity (Bytes)

            (g) SPARC T4 instruction

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            CL

            g1 gA2lt 16B

            Granularity (Bytes)

            (h) AESNI on Sandy Bridge

            observed LogCA

            Figure 7 Speedup curve fittings plots comparing LogCA with the observed values of AES [30]

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            g1 gA2

            CL

            Granularity (Bytes)

            Sp

            eed

            up

            (a) UltraSPARC T2 engine

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            g1 gA2

            CL

            Granularity (Bytes)

            (b) SPARC T3 engine

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            g1 gA2

            CL

            Granularity (Bytes)

            (c) SPARC T4 engine

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            gA2

            g1 lt 16B

            CL

            Granularity (Bytes)

            (d) SPARC T4 instruction

            observed LogCA

            Figure 8 Speedup curve fittings plots comparing LogCA with the observed values of SHA256 [31] LogCA starts following observedvalues after 64B

            APU spends considerable time in copying data from the host todevice memory

            Figure 8 shows the curve fitting for SHA on various on-chipcrypto accelerators We observe that g1 and g A

            2do exist as all of

            these are on-chip accelerators We also observe that the LogCAcurve mostly follows the observed value However it deviates fromthe observed value before 64B This happens because SHA requiresblock size of 64B for hash computation If the block size is less than64B it pads extra bits to make the block size 64B Since LogCAdoes not capture this effect it does not follow the observed speedupfor granularity smaller than 64B

            Figure 9-a shows the speedup curve fitting plots for Radix sortWe observe that LogCA does not follow observed values for smallergranularities on GPU Despite this inaccuracy LogCA accuratelypredicts g1 and g A

            2 We also observe that g A

            2for GPU is higher than

            APU and this observation supports equation (7) that increasingacceleration increases g A

            2

            52 Super-Linear Complexity Kernels (β gt 1)Figures 9-b and 9-c show the speedup curve fitting plots for super-complexity kernels on discrete GPU and APU We observe that ma-trix multiplication with higher complexity (O (g17)) achieves higherspeedup than sort and FFT with lower complexities of O (g) andO (g12) respectively This observation corroborates results fromequation (9) that achievable speedup of higher-complexity algo-rithms is higher than lower-complexity algorithms We also observethat g A

            2does not exist for FFT This happens because as we note in

            equation (12) that for g A2to exist for FFT C

            L should be greater thanA

            12 However Figure 9-c shows that CL is smaller than A

            12 for bothGPU and APU

            53 Sub-Linear Complexity Kernels (β lt 1)Figure 9-d shows the curve fitting for binary search which is asub-linear algorithm (β = 014) We make three observations Firstg1 does not exist even for very large granularities and C

            L lt 1 Thisobservation supports implication (5) that for a sub-linear algorithm

            ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            g1 gA2

            CL

            Sp

            eed

            up

            GPU

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            g1 gA2

            CL

            APU

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            g1 gA2

            GPU

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            g1 gA2

            CL

            APU

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100A

            g1

            CL

            Granularity (Bytes)

            Sp

            eed

            up

            GPU

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            A

            g1

            CL

            Granularity (Bytes)

            APU

            16 128 1K 8K 64

            K51

            2K 4M 32M

            001

            01

            1

            10

            100A

            g1 gA2

            CL

            Granularity (Bytes)

            GPU

            16 128 1K 8K 64

            K51

            2K 4M 32M

            001

            01

            1

            10

            100

            A

            g1 gA2

            CL

            Granularity (Bytes)

            APU

            (a) Radix Sort (b) Matrix Multiplication

            (c) FFT (d) Binary Search

            observed LogCA

            Figure 9 Speedup curve fittings plots comparing LogCA with the observed values of (a) Radix Sort (b) Matrix Multiplication (c) FFTand (d) Binary Search

            of β = 014 CL should be greater than 7 to provide any speedup

            Second for large granularities speedup starts decreasing with anincrease in granularity This observation supports our earlier claimin implication (4) that for systems with granularity dependent la-tencies speedup for sub-linear algorithms asymptotically decreasesThird LogCA deviates from the observed value at large granularitiesThis deviation occurs because LogCA does not model caches Asmentioned earlier LogCA abstracts the caches and memories witha single parameter of latency which does not capture the memory-access pattern accurately Even though LogCA does not accuratelycaptures binary search behavior it still provides an upper bound onthe achievable performance

            54 Case StudiesFigure 10 shows the evolution of crypto accelerators in SPARCarchitectures from the off-chip accelerators in pre-Niagara (Figure 10(a)) to accelerators integrated within the pipeline in SPARC T4(Figure 10 (e)) We observe that latency is absent in the on-chipacceleratorsrsquo optimization regions as these accelerators are closelycoupled with the host We also note that the optimization regionwith overheadsmdashrepresenting the complexity of an acceleratorrsquosinterfacemdashshrinks while the optimization regions with accelerationexpand from Figure 10 (a-e) For example for the off-chip cryptoaccelerator the cut-off granularity for overheads is 256KB whereasit is 128B for the SPARC T4 suggesting a much simpler interface

            Figure 10 (a) shows the optimization regions for the off-chipcrypto accelerator connected through the PCIe bus We note thatoverheads and latencies occupy most of the optimization regionsindicating high overhead OS calls and high-latency data copyingover the PCIe bus as the bottlenecks

            Figure 10 (b) shows the optimization regions for UltraSPARCT2 The large cut-off granularity for overheads at 32KB suggestsa complex interface indicating high overhead OS call creating abottleneck at small granularities The cut-off granularity of 2KB foracceleration suggests that optimizing acceleration is beneficial atlarge granularities

            Figure 10 (d) shows optimization regions for on-chip acceleratoron SPARC T4 There are three optimization regions with the cut-offgranularity for overhead now reduced to only 512B This observationsuggests a considerable improvement in the interface design overSPARC T3 and it is also evident by a smaller g1 We also note thatcut-off granularity for acceleration now decreases to 32B showingan increase in the opportunity for optimizing acceleration

            Figure 10 (e) shows optimization regions for crypto instructionson SPARC T4 We observe that unlike earlier designs it has only twooptimization regions and the speedup approaches the peak accelera-tion at a small granularity of 128B In contrast UltraSPARC T2 andSPARC T3 do not even provide any gains at this granularity We alsoobserve that the cut-off granularity for overheads further reduces to128B suggesting some opportunity for optimization at very smallgranularities The model also shows that the acceleration occupiesthe maximum range for optimization For example optimizing accel-eration provides benefits for granularities greater than 16B The lowoverhead access which LogCA shows is due to the non-privilegedinstruction SPARC T4 uses to access the cryptographic unit whichis integrated within the pipeline

            Figure 11 shows the evolution of memory interface design inGPU architectures It shows the optimization regions for matrixmultiplication on a discrete NVIDIA GPU an AMD integrated GPU(APU) and an integrated AMD GPU with HSA support We observethat matrix multiplication for all three architectures is compute bound

            LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            1000

            A

            CL

            oCLoC

            LC

            oL

            Granularity (Bytes)

            Sp

            eed

            up

            (a) PCIe Crypto Accelerator

            16 128 g1 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            1000

            A

            CL

            oC AoCA

            oA

            Granularity (Bytes)

            (b) UltraSPARC T2

            16 128 g1 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            1000CL

            A

            oC AoCA

            oA

            Granularity (Bytes)

            (c) SPARC T3

            g112

            8 1K 8K 64K

            512K 4M 32

            M

            01

            1

            10

            100

            1000

            A

            oCA A

            CL

            oA

            Granularity (Bytes)

            Sp

            eed

            up

            (d) SPARC T4 engine

            16 128 1K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            1000

            A

            oCA A

            CL

            oA

            Granularity (Bytes)

            (e) SPARC T4 instruction

            LogCA L110xo110x C10x A10x

            Figure 10 LogCA for performing Advanced Encryption Standard on various crypto accelerators LogCA identifies the design bottle-necks through LogCA parameters in an optimization region The bottlenecks which LogCA suggests in each design is optimized inthe next design

            16 128 1K 8K g1

            64K

            512K 4M 32

            M

            01

            1

            10

            100

            1000

            A

            LoC LCALC A

            ALC

            o

            Granularity (Bytes)

            Sp

            eed

            up

            (a) NVIDIA Discrete GPU

            16 128 1K 8K g1

            64K

            512K 4M 32

            M

            01

            1

            10

            100

            1000

            A

            LoCLo

            CA

            ACA

            AoL

            C

            Granularity (Bytes)

            (b) AMD Integrated GPU (APU)

            16 128 g11K 8K 64

            K51

            2K 4M 32M

            01

            1

            10

            100

            1000

            A

            oC o

            CA

            CA

            o

            AC

            Granularity (Bytes)

            (c) HSA supported AMD Integrated GPU

            LogCA L110xo110x C10x A10x

            Figure 11 Various Optimization regions for matrix multiplication over a range of granularities on (a) NVIDIA discrete GPU (b)AMD APU and (c) HSA Supported GPU

            (sect31) We also observe that the computational index occupies mostof the regions which signifies maximum optimization potential

            The discrete GPU has four optimization regions (Figure 11 (a))Among these latency dominates most of the regions signifyinghigh-latency data copying over the PCIe bus and thus maximumoptimization potential The high cut-off granularity for overheads at

            32KB indicates high overhead OS calls to access the GPU Similarlywith highly aggressive cores acceleration has high cut-off granular-ity of 256KB indicating less optimization potential for acceleration

            Similar to the discrete GPU the APU also has four optimiza-tion regions (Figure 11 (b)) There are few notable differences ascompared to the discrete GPU The cut-off granularity for latency

            ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

            reduces to 512KB with the elimination of data copying over thePCIe bus the overheads are still high suggesting high overhead OScalls to access the APU with less aggressive cores the cut-off granu-larity for acceleration reduces to 64KB implying more optimizationpotential for acceleration

            Figure 11 (c) shows three optimization regions for the HSA en-abled integrated GPU We observe that latency is absent in all regionsand the cut-off granularity for overhead reduces to 8KB These re-ductions in overheads and latencies signify a simpler interface ascompared to the discrete GPU and APU We also observe that thecut-off granularity for acceleration drops to 2KB suggesting higherpotential for optimizing acceleration

            6 RELATED WORKWe compare and contrast our work with prior approaches Lopez-Novoa et al [28] provide a detailed survey of various acceleratormodeling techniques We broadly classify these techniques in twocategories and discuss the most relevant work

            Analytical Models There is a rich body of work exploring ana-lytical models for performance prediction of accelerators For somemodels the motivation is to determine the future trend in heteroge-neous architectures Chung et al [7] in a detailed study predict thefuture landscape of heterogeneous computing Hempstead et al [17]propose an early-stage model Navigo that determines the fraction ofarea required for accelerators to maintain the traditional performancetrend Nilakantan et al [32] propose to incorporate communicationcost for early-stage model of accelerator-rich architectures For oth-ers the motivation is to determine the right amount of data to offloadand the potential benefits associated with an accelerator [24]

            Some analytical models are architecture specific For examplea number of studies [20 21 44 57] predict performance of GPUarchitectures Hong et al [20] present an analytical performancemodel for predicting execution time on GPUs They later extendtheir model and develop an integrated power and performance modelfor the GPUs [21] Song et al [44] use a simple counter basedapproach to predict power and performance Meswani et al [29]explore such models for high performance applications Daga etal [11] analyze the effectiveness of Accelerated processing units(APU) over GPUs and describe the communication cost over thePCIe bus as a major bottleneck in exploiting the full potential ofGPUs

            In general our work is different from these studies because ofthe complexity These models use a large number of parameters toaccurately predict the power andor performance whereas we limitthe number of parameters to reduce the complexity of our modelThey also require deep understanding of the underlying architectureMost of these models also require access to GPU specific assemblyor PTX codes Unlike these approaches we use CPU code to providebounds on the performance

            Roofline Models In terms of simplicity and motivation our workclosely matches the Roofline model [54]mdasha visual performancemodel for multi-core architectures Roofline exposes bottlenecks fora kernel and suggests several optimizations which programmers canuse to fine tune the kernel on a given system

            A number of extensions of Roofline have been proposed [1022 33 56] and some of these extensions are architecture specific

            For example targeting GPUs [22] vector processors [39] and FP-GAs [10 56]

            Despite the similarities roofline and its extensions cannot be usedfor exposing design bottlenecks in an acceleratorrsquos interface Theprimary goal of roofline models has been to help programmers andcompiler writer while LogCA provides more insights for architects

            7 CONCLUSION AND FUTURE WORKWith the recent trend towards heterogeneous computing we feelthat the architecture community lacks a model to reason about theneed of accelerators In this respect we propose LogCAmdashan insight-ful visual performance model for hardware accelerators LogCAprovides insights early in the design stage to both architects andprogrammers and identifies performance bounds exposes interfacedesign bottlenecks and suggest optimizations to alleviate these bot-tlenecks We have validated our model across a range of on-chip andoff-chip accelerators and have shown its utility using retrospectivestudies describing the evolution of acceleratorrsquos interface in thesearchitectures

            The applicability of LogCA can be limited by our simplifying as-sumptions and for more realistic analysis we plan to overcome theselimitations in our future work For example We also assume a singleaccelerator system and do not explicitly model contention amongresources Our model should handle multi-accelerator and pipelinedscenarios For fixed function accelerators our design space is cur-rently limited to encryption and hashing kernels To overcome thiswe are expanding our design space with compression and databaseaccelerators in Oracle M7 processor We also plan to complementLogCA with an energy model as energy efficiency is a prime designmetric for accelerators

            ACKNOWLEDGEMENTSWe thank our anonymous reviewers Arkaprava Basu and TonyNowatzki for their insightful comments and feedback on the paperThanks to Mark Hill Michael Swift Wisconsin Computer Archi-tecture Affiliates and other members of the Multifacet group fortheir valuable discussions We also thank Brian Wilson at Universityof Wisconsin DoIT and Eric Sedlar at Oracle Labs for providingaccess to SPARC T3 and T4 servers respectively Also thanks toMuhammad Umair Bin Altaf for his help in the formulation Thiswork is supported in part by the National Science Foundation (CNS-1302260 CCF-1438992 CCF-1533885 CCF- 1617824) Googleand the University of Wisconsin-Madison (Amar and Balindar SohiProfessorship in Computer Science) Wood has a significant financialinterest in AMD and Google

            REFERENCES[1] Advanced Micro Devices 2016 APP SDK - A Complete Development Platform

            Advanced Micro Devices httpdeveloperamdcomtools-and-sdksopencl-zoneamd-accelerated-parallel-processing-app-sdk

            [2] Gene M Amdahl 1967 Validity of the single processor approach to achievinglarge scale computing capabilities Proceedings of the April 18-20 1967 springjoint computer conference on - AFIPS rsquo67 (Spring) (1967) 483 httpsdoiorg10114514654821465560

            [3] Dan Anderson 2012 How to tell if SPARC T4 crypto is being used httpsblogsoraclecomDanXentryhow_to_tell_if_sparc

            [4] Krste Asanovic Rastislav Bodik James Demmel Tony Keaveny Kurt KeutzerJohn Kubiatowicz Nelson Morgan David Patterson Koushik Sen John

            LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

            Wawrzynek David Wessel and Katherine Yelick 2009 A View of the Par-allel Computing Landscape Commun ACM 52 10 (oct 2009) 56ndash67 httpsdoiorg10114515627641562783

            [5] Nathan Beckmann and Daniel Sanchez 2016 Cache Calculus Modeling Cachesthrough Differential Equations Computer Architecture Letters PP 99 (2016)1 httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=7366753$delimiter026E30F$npapers3publicationdoi101109LCA20152512873

            [6] C Cascaval S Chatterjee H Franke K J Gildea and P Pattnaik 2010 Ataxonomy of accelerator architectures and their programming models IBMJournal of Research and Development 54 (2010) 51ndash510 httpsdoiorg101147JRD20102059721

            [7] Eric S Chung Peter a Milder James C Hoe and Ken Mai 2010 Single-ChipHeterogeneous Computing Does the Future Include Custom Logic FPGAs andGPGPUs 2010 43rd Annual IEEEACM International Symposium on Microar-chitecture (dec 2010) 225ndash236 httpsdoiorg101109MICRO201036

            [8] Jason Cong Zhenman Fang Michael Gill and Glenn Reinman 2015 PARADEA Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Archi-tectural Design and Exploration In 2015 IEEEACM International Conference onComputer-Aided Design Austin TX

            [9] D Culler R Karp D Patterson and A Sahay 1993 LogP Towards a realisticmodel of parallel computation In Proceedings of the Fourth ACM SIGPLANSymposium on Principles and Practice of Parallel Programming 1ndash12 httpdlacmorgcitationcfmid=155333

            [10] Bruno da Silva An Braeken Erik H DrsquoHollander and Abdellah Touhafi 2013Performance Modeling for FPGAs Extending the Roofline Model with High-level Synthesis Tools Int J Reconfig Comput 2013 (jan 2013) 77mdash-77httpsdoiorg1011552013428078

            [11] Mayank Daga Ashwin M Aji and Wu-chun Feng 2011 On the Efficacy of aFused CPU+GPU Processor (or APU) for Parallel Computing In 2011 Symposiumon Application Accelerators in High-Performance Computing Ieee 141ndash149httpsdoiorg101109SAAHPC201129

            [12] Hadi Esmaeilzadeh Emily Blem Renee St Amant Karthikeyan Sankaralingamand Doug Burger 2011 Dark silicon and the end of multicore scaling InProceeding of the 38th annual international symposium on Computer archi-tecture - ISCA rsquo11 ACM Press New York New York USA 365 httpsdoiorg10114520000642000108

            [13] H Franke J Xenidis C Basso B M Bass S S Woodward J D Brown andC L Johnson 2010 Introduction to the wire-speed processor and architectureIBM Journal of Research and Development 54 (2010) 31ndash311 httpsdoiorg101147JRD20092036980

            [14] Venkatraman Govindaraju Chen Han Ho and Karthikeyan Sankaralingam 2011Dynamically specialized datapaths for energy efficient computing In Proceedings- International Symposium on High-Performance Computer Architecture 503ndash514httpsdoiorg101109HPCA20115749755

            [15] Shay Gueron 2012 Intel Advanced Encryption Standard (AES) Instructions SetTechnical Report Intel Corporation httpssoftwareintelcomsitesdefaultfilesarticle165683aes-wp-2012-09-22-v01pdf

            [16] Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex SolomatnikovBenjamin C Lee Stephen Richardson Christos Kozyrakis and Mark Horowitz2010 Understanding sources of inefficiency in general-purpose chips Proceed-ings of the 37th annual international symposium on Computer architecture - ISCA

            rsquo10 (2010) 37 httpsdoiorg10114518159611815968[17] Mark Hempstead Gu-Yeon Wei and David Brooks 2009 Navigo An early-

            stage model to study power-constrained architectures and specialization In ISCAWorkshop on Modeling Benchmarking and Simulations (MoBS)

            [18] John L Hennessy and David A Patterson 2006 Computer Architecture FourthEdition A Quantitative Approach 704 pages httpsdoiorg10111151881

            [19] Mark D Hill and Michael R Marty 2008 Amdahlrsquos Law in the Multicore EraComputer 41 7 (jul 2008) 33ndash38 httpsdoiorg101109MC2008209

            [20] Sunpyo Hong and Hyesoon Kim 2009 An analytical model for a GPU architec-ture with memory-level and thread-level parallelism awareness In Proceedingsof the 36th Annual International Symposium on Computer Architecture Vol 37152ndash163 httpsdoiorg10114515558151555775

            [21] Sunpyo Hong and Hyesoon Kim 2010 An integrated GPU power and perfor-mance model In Proceedings of the 37th Annual International Symposium onComputer Architecture Vol 38 280mdash-289 httpsdoiorg10114518160381815998

            [22] Haipeng Jia Yunquan Zhang Guoping Long Jianliang Xu Shengen Yan andYan Li 2012 GPURoofline A Model for Guiding Performance Optimizations onGPUs In Proceedings of the 18th International Conference on Parallel Processing(Euro-Parrsquo12) Springer-Verlag Berlin Heidelberg 920ndash932 httpsdoiorg101007978-3-642-32820-6_90

            [23] Onur Kocberber Boris Grot Javier Picorel Babak Falsafi Kevin Lim andParthasarathy Ranganathan 2013 Meet the Walkers Accelerating Index Traver-sals for In-memory Databases In Proceedings of the 46th Annual IEEEACMInternational Symposium on Microarchitecture (MICRO-46) ACM New YorkNY USA 468ndash479 httpsdoiorg10114525407082540748

            [24] Karthik Kumar Jibang Liu Yung Hsiang Lu and Bharat Bhargava 2013 Asurvey of computation offloading for mobile systems Mobile Networks andApplications 18 (2013) 129ndash140 httpsdoiorg101007s11036-012-0368-0

            [25] Snehasish Kumar Naveen Vedula Arrvindh Shriraman and Vijayalakshmi Srini-vasan 2015 DASX Hardware Accelerator for Software Data Structures InProceedings of the 29th ACM on International Conference on Supercomputing(ICS rsquo15) ACM New York NY USA 361ndash372 httpsdoiorg10114527512052751231

            [26] Maysam Lavasani Hari Angepat and Derek Chiou 2014 An FPGA-based In-Line Accelerator for Memcached IEEE Comput Archit Lett 13 2 (jul 2014)57ndash60 httpsdoiorg101109L-CA201317

            [27] John D C Little and Stephen C Graves 2008 Littlersquos law In Building intuitionSpringer 81ndash100

            [28] U Lopez-Novoa A Mendiburu and J Miguel-Alonso 2015 A Survey of Perfor-mance Modeling and Simulation Techniques for Accelerator-Based ComputingParallel and Distributed Systems IEEE Transactions on 26 1 (jan 2015) 272ndash281httpsdoiorg101109TPDS20142308216

            [29] M R Meswani L Carrington D Unat A Snavely S Baden and S Poole2013 Modeling and predicting performance of high performance computingapplications on hardware accelerators International Journal of High Perfor-mance Computing Applications 27 (2013) 89ndash108 httpsdoiorg1011771094342012468180

            [30] National Institute of Standards and Technology 2001 Advanced EncryptionStandard (AES) National Institute of Standards and Technology httpsdoiorg106028NISTFIPS197

            [31] National Institute of Standards and Technology 2008 Secure Hash StandardNational Institute of Standards and Technology httpcsrcnistgovpublicationsfipsfips180-3fips180-3_finalpdf

            [32] S Nilakantan S Battle and M Hempstead 2013 Metrics for Early-Stage Model-ing of Many-Accelerator Architectures Computer Architecture Letters 12 1 (jan2013) 25ndash28 httpsdoiorg101109L-CA20129

            [33] Cedric Nugteren and Henk Corporaal 2012 The Boat Hull Model EnablingPerformance Prediction for Parallel Computing Prior to Code Development Cate-gories and Subject Descriptors In Proceedings of the 9th Conference on Comput-ing Frontiers ACM 203mdash-212

            [34] OpenSSL Software Foundation 2015 OpenSSL Cryptography and SSLTLSToolkit OpenSSL Software Foundation httpsopensslorg

            [35] Sanjay Patel 2009 Sunrsquos Next-Generation Multithreaded Processor RainbowFalls In 21st Hot Chip Symposium httpwwwhotchipsorgwp-contentuploadshc

            [36] Sanjay Patel and Wen-mei W Hwu 2008 Accelerator Architectures IEEE Micro28 4 (jul 2008) 4ndash12 httpsdoiorg101109MM200850

            [37] Stephen Phillips 2014 M7 Next Generation SPARC In 26th Hot Chip Sympo-sium

            [38] Phil Rogers 2013 Heterogeneous system architecture overview In Hot Chips[39] Yoshiei Sato Ryuichi Nagaoka Akihiro Musa Ryusuke Egawa Hiroyuki Tak-

            izawa Koki Okabe and Hiroaki Kobayashi 2009 Performance tuning andanalysis of future vector processors based on the roofline model Proceedingsof the 10th MEDEA workshop on MEmory performance DEaling with Applica-tions systems and architecture - MEDEA rsquo09 (2009) 7 httpsdoiorg10114516219601621962

            [40] M Shah J Barren J Brooks R Golla G Grohoski N Gura R Hetherington PJordan M Luttrell C Olson B Sana D Sheahan L Spracklen and A Wynn2007 UltraSPARC T2 A highly-treaded power-efficient SPARC SOC InSolid-State Circuits Conference 2007 ASSCC rsquo07 IEEE Asian 22ndash25 httpsdoiorg101109ASSCC20074425786

            [41] Manish Shah Robert Golla Gregory Grohoski Paul Jordan Jama Barreh JeffreyBrooks Mark Greenberg Gideon Levinsky Mark Luttrell Christopher OlsonZeid Samoail Matt Smittle and Thomas Ziaja 2012 Sparc T4 A dynamicallythreaded server-on-a-chip IEEE Micro 32 (2012) 8ndash19 httpsdoiorg101109MM20121

            [42] Yakun Sophia Shao Brandon Reagen Gu-Yeon Wei and David Brooks 2014Aladdin A Pre-RTL Power-Performance Accelerator Simulator Enabling LargeDesign Space Exploration of Customized Architectures In International Sympo-sium on Computer Architecture (ISCA)

            [43] Soekris Engineering 2016 vpn 1401 for Std PCI-sockets Soekris Engineeringhttpsoekriscomproductsvpn-1401html

            [44] Shuaiwen Song Chunyi Su Barry Rountree and Kirk W Cameron 2013 Asimplified and accurate model of power-performance efficiency on emergent GPUarchitectures In Proceedings - IEEE 27th International Parallel and DistributedProcessing Symposium IPDPS 2013 673ndash686 httpsdoiorg101109IPDPS201373

            [45] Jeff Stuecheli 2013 POWER8 In 25th Hot Chip Symposium[46] Ning Sun and Chi-Chang Lin 2007 Using the Cryptographic Accelerators in the

            UltraSPARC T1 and T2 processors Technical Report httpwwworaclecomtechnetworkserver-storagesolarisdocumentation819-5782-150147pdf

            ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

            [47] S Tabik G Ortega and E M Garzoacuten 2014 Performance evaluation of ker-nel fusion BLAS routines on the GPU iterative solvers as case study TheJournal of Supercomputing 70 2 (nov 2014) 577ndash587 httpsdoiorg101007s11227-014-1102-4

            [48] Y C Tay 2013 Analytical Performance Modeling for Computer Systems (2nded) Morgan amp Claypool Publishers

            [49] MB Taylor 2012 Is dark silicon useful harnessing the four horsemen of thecoming dark silicon apocalypse In Design Automation Conference (DAC) 201249th ACMEDACIEEE 1131ndash1136 httpsdoiorg10114522283602228567

            [50] G Venkatesh J Sampson N Goulding S Garcia V Bryksin J Lugo-MartinezS Swanson and M B Taylor 2010 Conservation cores Reducing the energyof mature computations In International Conference on Architectural Supportfor Programming Languages and Operating Systems - ASPLOS 205ndash218 httpsdoiorg10114517360201736044

            [51] Ganesh Venkatesh Jack Sampson Nathan Goulding-Hotta Sravanthi KotaVenkata Michael Bedford Taylor and Steven Swanson 2011 QsCores Trad-ing Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores InProceedings of the 44th Annual IEEEACM International Symposium on Microar-chitecture - MICRO-44 rsquo11 163 httpsdoiorg10114521556202155640

            [52] Guibin Wang Yisong Lin and Wei Yi 2010 Kernel Fusion An EffectiveMethod for Better Power Efficiency on Multithreaded GPU In Green Computingand Communications (GreenCom) 2010 IEEEACM Intrsquol Conference on CyberPhysical and Social Computing (CPSCom) 344ndash350 httpsdoiorg101109GreenCom-CPSCom2010102

            [53] Eric W Weisstein 2015 Newtonrsquos Method From MathWorld ndash A Wolfram WebResource httpmathworldwolframcomNewtonsMethodhtml

            [54] Samuel Williams Andrew Waterman and David Patterson 2009 Roofline aninsightful visual performance model for multicore architectures Commun ACM52 (2009) 65ndash76 httpsdoiorg10114514987651498785

            [55] Lisa Wu Andrea Lottarini Timothy K Paine Martha A Kim and Kenneth ARoss 2014 Q100 The Architecture and Design of a Database Processing UnitIn Proceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS rsquo14) ACM NewYork NY USA 255ndash268 httpsdoiorg10114525419402541961

            [56] Moein Pahlavan Yali 2014 FPGA-Roofline An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems PhD DissertationVirginia Polytechnic Institute and State University

            [57] Yao Zhang and John D Owens 2011 A quantitative performance analysismodel for GPU architectures In Proceedings - International Symposium on High-Performance Computer Architecture 382ndash393 httpsdoiorg101109HPCA20115749745

            • Abstract
            • 1 Introduction
            • 2 The LogCA Model
              • 21 Effect of Granularity
              • 22 Performance Metrics
              • 23 Granularity dependent latency
                • 3 Applications of LogCA
                  • 31 Performance Bounds
                  • 32 Sensitivity Analysis
                    • 4 Experimental Methodology
                    • 5 Evaluation
                      • 51 Linear-Complexity Kernels (= 1)
                      • 52 Super-Linear Complexity Kernels (gt 1)
                      • 53 Sub-Linear Complexity Kernels (lt 1)
                      • 54 Case Studies
                        • 6 Related Work
                        • 7 Conclusion and Future Work
                        • References

              LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

              Table 2 Description of the Cryptographic accelerators

              Crypto Accelerator PCI Crypto UltraSPARC T2 SPARC T3 SPARC T4 Sandy BridgeProcessor AMD A8-3850 S2 S2 S3 Intel Core i7-2600Frequency 29 GHz 116 GHz 165 GHz 3 GHz 34 GHzOpenSSL version 098o 098o 098o 102 101k 098oKernel Ubuntu 3130-55 Oracle Solaris 11 Oracle Solaris 11 Oracle Solaris 112 Linux2632-504

              16 128 g1 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              1000

              A

              CL

              oC AoCA

              oC

              A

              Granularity (Bytes)

              Sp

              eed

              up

              LogCA L110xo110x C10x A10x

              Figure 6 Optimization regions for UltraSPARC T2 The pres-ence of a parameter in an optimization region indicates thatit can at least provides 20 gains The horizontal arrow in-dicates the cut-off granularity at which a parameter provides20 gains

              parameter For example optimizing acceleration starts providingbenefits from 2KB while optimizing overheads or computationalindex is beneficial up till 32KB These arrows also indicate thecut-off granularity for each parameter These cut-off granularitiesprovide insights to architects and programmers about the designbottlenecks For example high cut-off granularity of 32KB suggestshigh overheads and thus a potential for optimization

              4 EXPERIMENTAL METHODOLOGYThis section describes the experimental setup and benchmarks forvalidating LogCA on real machines We also discuss our methodol-ogy for measuring LogCA parameters and performance metrics

              Our experimental setup comprises of on-chip and off-chip cryptoaccelerators (Table 2) and three different GPUs (Table 3) The on-chip crypto accelerators include cryptographic units on SunOracleUltraSPARC T2 [40] SPARC T3 [35] SPARC T4 [41] and AES-NI(AES New Instruction) [15] on Sandy Bridge whereas the off-chipaccelerator is a Hifn 7955 chip connected through the PCIe bus [43]The GPUs include a discrete NVIDIA GPU an integrated AMDGPU (APU) and HSA supported integrated GPU

              For the on-chip crypto accelerators each core in UltraSPARC T2and SPARC T3 has a physically addressed crypto unit which requiresprivileged DMA calls However the crypto unit on SPARC T4 isintegrated within the pipeline and does not require privileged DMAcalls SPARC T4 also provides non-privileged crypto instructions toaccess the crypto unit Similar to SPARC T4 sandy bridge providesnon-privileged crypto instructionmdashAESNI

              Considering the GPUs the discrete GPU is connected throughthe PCIe bus whereas for the APU the GPU is co-located with thehost processor on the same die For the APU the system memoryis partitioned between host and GPU memory This eliminates thePCIe bottleneck of data copying but it still requires copying databetween memories Unlike discrete GPU and APU HSA supportedGPU provides a unified and coherent view of the system memoryWith the host and GPU share the same virtual address space explicitcopying of data between memories is not required

              Our workloads consist of encryption hashing and GPU kernelsFor encryption and hashing we have used advanced encryptionstandard (AES) [30] and standard hashing algorithm (SHA) [31]respectively from OpenSSL [34]mdashan open source cryptography li-brary For GPU kernels we use matrix multiplication radix sortFFT and binary search from AMD OpenCL SDK [1] Table 4 we listthe complexities of each kernel both in terms of number of elementsn and granularity g We expect these complexities to remain same inboth cases but we observe that they differ for matrix multiplicationFor example for a square matrix of size n matrix multiplication hascomplexity of O (n3) whereas the complexity in terms of granularityis O (g17) This happens because for matrix multiplicationmdashunlikeothersmdashcomputations are performed on matrices and not vectorsSo offloading a square matrix of size n corresponds to offloading n2

              elements which results in the apparent discrepancy in the complexi-ties We also observe that for the granularity range of 16B to 32MBβ = 011 provides a close approximation for log(g)

              Table 3 Description of the GPUs

              Platform Discrete GPU Integrated APU AMD HSAName Tesla C2070 Radeon HD 6550 Radeon R7Architecture Fermi Beaver Creek KaveriCores 16 5 8Compute Units 448 400 512Clock Freq 15 GHz 600 MHz 720 MHzPeak FLOPS 1 T 480 G 856 GHostProcessor Intel AMD AMD

              Xeon E5520 A8-3850 A10-7850KFrequency GHz 227 29 17

              For calculating execution times we have used Linux utilities onthe crypto accelerators whereas for the GPUs we have used NVIDIAand AMD OpenCL profilers to compute the setup kernel and datatransfer times and we report the average of one hundred executionsFor verifying the usage of crypto accelerators we use built-in coun-ters in UltraSPARC T2 and T3 [46] SPARC T4 however no longer

              ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

              Table 4 Algorithmic complexity of various kernels with num-ber of elements and granularity The power of g represents β

              for each kernel

              Kernel Algorithmic ComplexityAdvanced Encryption Standard (AES) O (n) O (g101)Secure Hashing Algorithm (SHA) O (n) O (g097)Matrix Multiplication (GEMM) O (n3) O (g17)Fast Fourier Transform (FFT) O (n logn) O (g12)Radix Sort O (kn) O (g094)Binary Search O (logn) O (g014)

              Table 5 Calculated values of LogCA Parameters

              LogCA ParametersDevice Benchmark L o C A

              (cycles) (cycles) (cyclesB)

              Discrete GPU

              AES 174Radix Sort 290GEMM 3times103 2times108 2 30FFT 290Binary Search 116

              APU

              AES 174Radix Sort 290GEMM 15 4times108 2 7FFT 290Binary Search 116

              UltraSPARC T2 AES 1500 29times104 90 19SHA 105times103 72 12

              SPARC T3 AES 1500 27times104 90 12SHA 105times103 72 10

              SPARC T4 AES 500 435 32 12SHA 16times103 32 10

              SPARC T4 instr AES 4 111 32 12SHA 1638 32 10

              Sandy Bridge AES 3 10 35 6

              supports these counters so we use Linux utilities to trace the execu-tion of the crypto instructions [3] We use these execution times todetermine LogCA parameters We calculate these parameters onceand can be later used for different kernels on the same system

              For computational index and β we profile the CPU code on thehost by varying the granularity from 16B to 32MB At each granu-larity we measure the execution time and use regression analysisto determine C and β For overheads we use the observation thatfor very small granularities the execution time for a kernel on anaccelerator is dominated by the overheads ie limgrarr0 T1 (g) ≃ oFor acceleration we use different methods for the on-chip accelera-tors and GPUs For on-chip accelerators we calculate accelerationusing equation (3) and the observation that the speedup curve flat-tens out and approaches acceleration for very large granularitiesHowever for the GPUs we do not use equation (3) as it requirescomputing acceleration for each kernel as each application has adifferent access pattern which affects the speedup So we boundthe maximum performance using the peak flops from the devicespecifications We use the ratio of peak GFLOPs on CPU and GPUie A = Peak GFLOPGPU

              Peak GFLOPCPU Similar to acceleration we use two different

              techniques for calculating latency For the on-chip accelerators we

              run micro-benchmarks and use execution time on host and acceler-ators On the other hand for the GPUs we compute latency usingpeak memory bandwidth of the GPU Similar to Meswani et al [29]we use the following equation for measuring data copying time forthe GPUs L = 1

              BWpeak

              Earlier we develop our model using assumptions of granularityindependent and dependent latencies In our setup we observe thatthe on-chip crypto accelerators and HSA-enabled GPU representaccelerators with granularity independent latency while the off-chipcrypto accelerator and discrete GPUAPU represent the granular-ity dependent accelerators For each accelerator we calculate thespeedup and performance metrics using the respective equations(sect2)

              5 EVALUATIONIn this section we show that LogCA closely captures the behavior forboth off and on-chip accelerators We also list the calculate LogCAparameters in Table 5 To demonstrate the utility of our modelwe also present two case studies In these studies we consider theevolution of interface in SUNOraclersquos crypto accelerators and threedifferent GPU architectures In both cases we elaborate the designchanges using the insights LogCA provides

              51 Linear-Complexity Kernels (β = 1)Figure 7 shows the curve-fitting of LogCA for AES We considerboth off-chip and on-chip accelerators connected through differentinterfaces ranging from PCIe bus to special instructions We observethat the off-chip accelerators and APU unlike on-chip acceleratorsprovide reasonable speedup only at very large granularities We alsoobserve that the achievable speedup is limited by computationalintensity for off-chip accelerators and acceleration for on-chip accel-erators This observation supports earlier implication on the limitsof speedup for granularity independent and dependent latencies inequation (3) and (9) respectively

              Figure 7 also shows that UltraSPARC T2 provides higher speedupsthan Sandy Bridge but it breaks-even at a larger granularity SandyBridge on the other hand breaks-even at very small granularitybut provides limited speedup The discrete GPU with powerful pro-cessing cores has the highest acceleration among others Howeverits observed speedup is less than others due to high overheads andlatencies involved in communicating through the PCIe bus

              We have also marked g1 and g A2for each accelerator in Figure 7

              which help programmers and architects identify the complexity ofthe interface For example g1 for crypto instructions ie SPARCT4 and Sandy Bridge lies on the extreme left while for the off-chipaccelerators g1 lies on the far right It is worth mentioning that wehave marked g a

              2for on-chip accelerators but not for the off-chip

              accelerators For off-chip accelerators computational intensity isless than acceleration and as we have noted in equation (12) thatg A

              2for these designs does not existWe also observe that g1 for the crypto-card connected through

              the PCIe bus does not exist showing that this accelerator does notbreak-even even for large granularities Figure 7 also shows thatg1 for GPU and APU is comparable This observation shows thatdespite being an integrated GPU and not connected to the PCIe bus

              LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              CL

              Sp

              eed

              up

              (a) PCIe crypto

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100A

              g1

              CL

              (b) NVIDIA Discrete GPU

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              g1 gA2

              CL

              (c) AMD Integrated GPU (APU)

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              g1 gA2

              CL

              (d) UltraSPARC T2

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              g1 gA2

              CL

              Granularity (Bytes)

              Sp

              eed

              up

              (e) SPARC T3

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              g1 gA2

              CL

              Granularity (Bytes)

              (f) SPARC T4 engine

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              gA2

              CL

              g1 lt 16B

              Granularity (Bytes)

              (g) SPARC T4 instruction

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              CL

              g1 gA2lt 16B

              Granularity (Bytes)

              (h) AESNI on Sandy Bridge

              observed LogCA

              Figure 7 Speedup curve fittings plots comparing LogCA with the observed values of AES [30]

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              g1 gA2

              CL

              Granularity (Bytes)

              Sp

              eed

              up

              (a) UltraSPARC T2 engine

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              g1 gA2

              CL

              Granularity (Bytes)

              (b) SPARC T3 engine

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              g1 gA2

              CL

              Granularity (Bytes)

              (c) SPARC T4 engine

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              gA2

              g1 lt 16B

              CL

              Granularity (Bytes)

              (d) SPARC T4 instruction

              observed LogCA

              Figure 8 Speedup curve fittings plots comparing LogCA with the observed values of SHA256 [31] LogCA starts following observedvalues after 64B

              APU spends considerable time in copying data from the host todevice memory

              Figure 8 shows the curve fitting for SHA on various on-chipcrypto accelerators We observe that g1 and g A

              2do exist as all of

              these are on-chip accelerators We also observe that the LogCAcurve mostly follows the observed value However it deviates fromthe observed value before 64B This happens because SHA requiresblock size of 64B for hash computation If the block size is less than64B it pads extra bits to make the block size 64B Since LogCAdoes not capture this effect it does not follow the observed speedupfor granularity smaller than 64B

              Figure 9-a shows the speedup curve fitting plots for Radix sortWe observe that LogCA does not follow observed values for smallergranularities on GPU Despite this inaccuracy LogCA accuratelypredicts g1 and g A

              2 We also observe that g A

              2for GPU is higher than

              APU and this observation supports equation (7) that increasingacceleration increases g A

              2

              52 Super-Linear Complexity Kernels (β gt 1)Figures 9-b and 9-c show the speedup curve fitting plots for super-complexity kernels on discrete GPU and APU We observe that ma-trix multiplication with higher complexity (O (g17)) achieves higherspeedup than sort and FFT with lower complexities of O (g) andO (g12) respectively This observation corroborates results fromequation (9) that achievable speedup of higher-complexity algo-rithms is higher than lower-complexity algorithms We also observethat g A

              2does not exist for FFT This happens because as we note in

              equation (12) that for g A2to exist for FFT C

              L should be greater thanA

              12 However Figure 9-c shows that CL is smaller than A

              12 for bothGPU and APU

              53 Sub-Linear Complexity Kernels (β lt 1)Figure 9-d shows the curve fitting for binary search which is asub-linear algorithm (β = 014) We make three observations Firstg1 does not exist even for very large granularities and C

              L lt 1 Thisobservation supports implication (5) that for a sub-linear algorithm

              ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              g1 gA2

              CL

              Sp

              eed

              up

              GPU

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              g1 gA2

              CL

              APU

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              g1 gA2

              GPU

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              g1 gA2

              CL

              APU

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100A

              g1

              CL

              Granularity (Bytes)

              Sp

              eed

              up

              GPU

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              A

              g1

              CL

              Granularity (Bytes)

              APU

              16 128 1K 8K 64

              K51

              2K 4M 32M

              001

              01

              1

              10

              100A

              g1 gA2

              CL

              Granularity (Bytes)

              GPU

              16 128 1K 8K 64

              K51

              2K 4M 32M

              001

              01

              1

              10

              100

              A

              g1 gA2

              CL

              Granularity (Bytes)

              APU

              (a) Radix Sort (b) Matrix Multiplication

              (c) FFT (d) Binary Search

              observed LogCA

              Figure 9 Speedup curve fittings plots comparing LogCA with the observed values of (a) Radix Sort (b) Matrix Multiplication (c) FFTand (d) Binary Search

              of β = 014 CL should be greater than 7 to provide any speedup

              Second for large granularities speedup starts decreasing with anincrease in granularity This observation supports our earlier claimin implication (4) that for systems with granularity dependent la-tencies speedup for sub-linear algorithms asymptotically decreasesThird LogCA deviates from the observed value at large granularitiesThis deviation occurs because LogCA does not model caches Asmentioned earlier LogCA abstracts the caches and memories witha single parameter of latency which does not capture the memory-access pattern accurately Even though LogCA does not accuratelycaptures binary search behavior it still provides an upper bound onthe achievable performance

              54 Case StudiesFigure 10 shows the evolution of crypto accelerators in SPARCarchitectures from the off-chip accelerators in pre-Niagara (Figure 10(a)) to accelerators integrated within the pipeline in SPARC T4(Figure 10 (e)) We observe that latency is absent in the on-chipacceleratorsrsquo optimization regions as these accelerators are closelycoupled with the host We also note that the optimization regionwith overheadsmdashrepresenting the complexity of an acceleratorrsquosinterfacemdashshrinks while the optimization regions with accelerationexpand from Figure 10 (a-e) For example for the off-chip cryptoaccelerator the cut-off granularity for overheads is 256KB whereasit is 128B for the SPARC T4 suggesting a much simpler interface

              Figure 10 (a) shows the optimization regions for the off-chipcrypto accelerator connected through the PCIe bus We note thatoverheads and latencies occupy most of the optimization regionsindicating high overhead OS calls and high-latency data copyingover the PCIe bus as the bottlenecks

              Figure 10 (b) shows the optimization regions for UltraSPARCT2 The large cut-off granularity for overheads at 32KB suggestsa complex interface indicating high overhead OS call creating abottleneck at small granularities The cut-off granularity of 2KB foracceleration suggests that optimizing acceleration is beneficial atlarge granularities

              Figure 10 (d) shows optimization regions for on-chip acceleratoron SPARC T4 There are three optimization regions with the cut-offgranularity for overhead now reduced to only 512B This observationsuggests a considerable improvement in the interface design overSPARC T3 and it is also evident by a smaller g1 We also note thatcut-off granularity for acceleration now decreases to 32B showingan increase in the opportunity for optimizing acceleration

              Figure 10 (e) shows optimization regions for crypto instructionson SPARC T4 We observe that unlike earlier designs it has only twooptimization regions and the speedup approaches the peak accelera-tion at a small granularity of 128B In contrast UltraSPARC T2 andSPARC T3 do not even provide any gains at this granularity We alsoobserve that the cut-off granularity for overheads further reduces to128B suggesting some opportunity for optimization at very smallgranularities The model also shows that the acceleration occupiesthe maximum range for optimization For example optimizing accel-eration provides benefits for granularities greater than 16B The lowoverhead access which LogCA shows is due to the non-privilegedinstruction SPARC T4 uses to access the cryptographic unit whichis integrated within the pipeline

              Figure 11 shows the evolution of memory interface design inGPU architectures It shows the optimization regions for matrixmultiplication on a discrete NVIDIA GPU an AMD integrated GPU(APU) and an integrated AMD GPU with HSA support We observethat matrix multiplication for all three architectures is compute bound

              LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              1000

              A

              CL

              oCLoC

              LC

              oL

              Granularity (Bytes)

              Sp

              eed

              up

              (a) PCIe Crypto Accelerator

              16 128 g1 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              1000

              A

              CL

              oC AoCA

              oA

              Granularity (Bytes)

              (b) UltraSPARC T2

              16 128 g1 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              1000CL

              A

              oC AoCA

              oA

              Granularity (Bytes)

              (c) SPARC T3

              g112

              8 1K 8K 64K

              512K 4M 32

              M

              01

              1

              10

              100

              1000

              A

              oCA A

              CL

              oA

              Granularity (Bytes)

              Sp

              eed

              up

              (d) SPARC T4 engine

              16 128 1K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              1000

              A

              oCA A

              CL

              oA

              Granularity (Bytes)

              (e) SPARC T4 instruction

              LogCA L110xo110x C10x A10x

              Figure 10 LogCA for performing Advanced Encryption Standard on various crypto accelerators LogCA identifies the design bottle-necks through LogCA parameters in an optimization region The bottlenecks which LogCA suggests in each design is optimized inthe next design

              16 128 1K 8K g1

              64K

              512K 4M 32

              M

              01

              1

              10

              100

              1000

              A

              LoC LCALC A

              ALC

              o

              Granularity (Bytes)

              Sp

              eed

              up

              (a) NVIDIA Discrete GPU

              16 128 1K 8K g1

              64K

              512K 4M 32

              M

              01

              1

              10

              100

              1000

              A

              LoCLo

              CA

              ACA

              AoL

              C

              Granularity (Bytes)

              (b) AMD Integrated GPU (APU)

              16 128 g11K 8K 64

              K51

              2K 4M 32M

              01

              1

              10

              100

              1000

              A

              oC o

              CA

              CA

              o

              AC

              Granularity (Bytes)

              (c) HSA supported AMD Integrated GPU

              LogCA L110xo110x C10x A10x

              Figure 11 Various Optimization regions for matrix multiplication over a range of granularities on (a) NVIDIA discrete GPU (b)AMD APU and (c) HSA Supported GPU

              (sect31) We also observe that the computational index occupies mostof the regions which signifies maximum optimization potential

              The discrete GPU has four optimization regions (Figure 11 (a))Among these latency dominates most of the regions signifyinghigh-latency data copying over the PCIe bus and thus maximumoptimization potential The high cut-off granularity for overheads at

              32KB indicates high overhead OS calls to access the GPU Similarlywith highly aggressive cores acceleration has high cut-off granular-ity of 256KB indicating less optimization potential for acceleration

              Similar to the discrete GPU the APU also has four optimiza-tion regions (Figure 11 (b)) There are few notable differences ascompared to the discrete GPU The cut-off granularity for latency

              ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

              reduces to 512KB with the elimination of data copying over thePCIe bus the overheads are still high suggesting high overhead OScalls to access the APU with less aggressive cores the cut-off granu-larity for acceleration reduces to 64KB implying more optimizationpotential for acceleration

              Figure 11 (c) shows three optimization regions for the HSA en-abled integrated GPU We observe that latency is absent in all regionsand the cut-off granularity for overhead reduces to 8KB These re-ductions in overheads and latencies signify a simpler interface ascompared to the discrete GPU and APU We also observe that thecut-off granularity for acceleration drops to 2KB suggesting higherpotential for optimizing acceleration

              6 RELATED WORKWe compare and contrast our work with prior approaches Lopez-Novoa et al [28] provide a detailed survey of various acceleratormodeling techniques We broadly classify these techniques in twocategories and discuss the most relevant work

              Analytical Models There is a rich body of work exploring ana-lytical models for performance prediction of accelerators For somemodels the motivation is to determine the future trend in heteroge-neous architectures Chung et al [7] in a detailed study predict thefuture landscape of heterogeneous computing Hempstead et al [17]propose an early-stage model Navigo that determines the fraction ofarea required for accelerators to maintain the traditional performancetrend Nilakantan et al [32] propose to incorporate communicationcost for early-stage model of accelerator-rich architectures For oth-ers the motivation is to determine the right amount of data to offloadand the potential benefits associated with an accelerator [24]

              Some analytical models are architecture specific For examplea number of studies [20 21 44 57] predict performance of GPUarchitectures Hong et al [20] present an analytical performancemodel for predicting execution time on GPUs They later extendtheir model and develop an integrated power and performance modelfor the GPUs [21] Song et al [44] use a simple counter basedapproach to predict power and performance Meswani et al [29]explore such models for high performance applications Daga etal [11] analyze the effectiveness of Accelerated processing units(APU) over GPUs and describe the communication cost over thePCIe bus as a major bottleneck in exploiting the full potential ofGPUs

              In general our work is different from these studies because ofthe complexity These models use a large number of parameters toaccurately predict the power andor performance whereas we limitthe number of parameters to reduce the complexity of our modelThey also require deep understanding of the underlying architectureMost of these models also require access to GPU specific assemblyor PTX codes Unlike these approaches we use CPU code to providebounds on the performance

              Roofline Models In terms of simplicity and motivation our workclosely matches the Roofline model [54]mdasha visual performancemodel for multi-core architectures Roofline exposes bottlenecks fora kernel and suggests several optimizations which programmers canuse to fine tune the kernel on a given system

              A number of extensions of Roofline have been proposed [1022 33 56] and some of these extensions are architecture specific

              For example targeting GPUs [22] vector processors [39] and FP-GAs [10 56]

              Despite the similarities roofline and its extensions cannot be usedfor exposing design bottlenecks in an acceleratorrsquos interface Theprimary goal of roofline models has been to help programmers andcompiler writer while LogCA provides more insights for architects

              7 CONCLUSION AND FUTURE WORKWith the recent trend towards heterogeneous computing we feelthat the architecture community lacks a model to reason about theneed of accelerators In this respect we propose LogCAmdashan insight-ful visual performance model for hardware accelerators LogCAprovides insights early in the design stage to both architects andprogrammers and identifies performance bounds exposes interfacedesign bottlenecks and suggest optimizations to alleviate these bot-tlenecks We have validated our model across a range of on-chip andoff-chip accelerators and have shown its utility using retrospectivestudies describing the evolution of acceleratorrsquos interface in thesearchitectures

              The applicability of LogCA can be limited by our simplifying as-sumptions and for more realistic analysis we plan to overcome theselimitations in our future work For example We also assume a singleaccelerator system and do not explicitly model contention amongresources Our model should handle multi-accelerator and pipelinedscenarios For fixed function accelerators our design space is cur-rently limited to encryption and hashing kernels To overcome thiswe are expanding our design space with compression and databaseaccelerators in Oracle M7 processor We also plan to complementLogCA with an energy model as energy efficiency is a prime designmetric for accelerators

              ACKNOWLEDGEMENTSWe thank our anonymous reviewers Arkaprava Basu and TonyNowatzki for their insightful comments and feedback on the paperThanks to Mark Hill Michael Swift Wisconsin Computer Archi-tecture Affiliates and other members of the Multifacet group fortheir valuable discussions We also thank Brian Wilson at Universityof Wisconsin DoIT and Eric Sedlar at Oracle Labs for providingaccess to SPARC T3 and T4 servers respectively Also thanks toMuhammad Umair Bin Altaf for his help in the formulation Thiswork is supported in part by the National Science Foundation (CNS-1302260 CCF-1438992 CCF-1533885 CCF- 1617824) Googleand the University of Wisconsin-Madison (Amar and Balindar SohiProfessorship in Computer Science) Wood has a significant financialinterest in AMD and Google

              REFERENCES[1] Advanced Micro Devices 2016 APP SDK - A Complete Development Platform

              Advanced Micro Devices httpdeveloperamdcomtools-and-sdksopencl-zoneamd-accelerated-parallel-processing-app-sdk

              [2] Gene M Amdahl 1967 Validity of the single processor approach to achievinglarge scale computing capabilities Proceedings of the April 18-20 1967 springjoint computer conference on - AFIPS rsquo67 (Spring) (1967) 483 httpsdoiorg10114514654821465560

              [3] Dan Anderson 2012 How to tell if SPARC T4 crypto is being used httpsblogsoraclecomDanXentryhow_to_tell_if_sparc

              [4] Krste Asanovic Rastislav Bodik James Demmel Tony Keaveny Kurt KeutzerJohn Kubiatowicz Nelson Morgan David Patterson Koushik Sen John

              LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

              Wawrzynek David Wessel and Katherine Yelick 2009 A View of the Par-allel Computing Landscape Commun ACM 52 10 (oct 2009) 56ndash67 httpsdoiorg10114515627641562783

              [5] Nathan Beckmann and Daniel Sanchez 2016 Cache Calculus Modeling Cachesthrough Differential Equations Computer Architecture Letters PP 99 (2016)1 httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=7366753$delimiter026E30F$npapers3publicationdoi101109LCA20152512873

              [6] C Cascaval S Chatterjee H Franke K J Gildea and P Pattnaik 2010 Ataxonomy of accelerator architectures and their programming models IBMJournal of Research and Development 54 (2010) 51ndash510 httpsdoiorg101147JRD20102059721

              [7] Eric S Chung Peter a Milder James C Hoe and Ken Mai 2010 Single-ChipHeterogeneous Computing Does the Future Include Custom Logic FPGAs andGPGPUs 2010 43rd Annual IEEEACM International Symposium on Microar-chitecture (dec 2010) 225ndash236 httpsdoiorg101109MICRO201036

              [8] Jason Cong Zhenman Fang Michael Gill and Glenn Reinman 2015 PARADEA Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Archi-tectural Design and Exploration In 2015 IEEEACM International Conference onComputer-Aided Design Austin TX

              [9] D Culler R Karp D Patterson and A Sahay 1993 LogP Towards a realisticmodel of parallel computation In Proceedings of the Fourth ACM SIGPLANSymposium on Principles and Practice of Parallel Programming 1ndash12 httpdlacmorgcitationcfmid=155333

              [10] Bruno da Silva An Braeken Erik H DrsquoHollander and Abdellah Touhafi 2013Performance Modeling for FPGAs Extending the Roofline Model with High-level Synthesis Tools Int J Reconfig Comput 2013 (jan 2013) 77mdash-77httpsdoiorg1011552013428078

              [11] Mayank Daga Ashwin M Aji and Wu-chun Feng 2011 On the Efficacy of aFused CPU+GPU Processor (or APU) for Parallel Computing In 2011 Symposiumon Application Accelerators in High-Performance Computing Ieee 141ndash149httpsdoiorg101109SAAHPC201129

              [12] Hadi Esmaeilzadeh Emily Blem Renee St Amant Karthikeyan Sankaralingamand Doug Burger 2011 Dark silicon and the end of multicore scaling InProceeding of the 38th annual international symposium on Computer archi-tecture - ISCA rsquo11 ACM Press New York New York USA 365 httpsdoiorg10114520000642000108

              [13] H Franke J Xenidis C Basso B M Bass S S Woodward J D Brown andC L Johnson 2010 Introduction to the wire-speed processor and architectureIBM Journal of Research and Development 54 (2010) 31ndash311 httpsdoiorg101147JRD20092036980

              [14] Venkatraman Govindaraju Chen Han Ho and Karthikeyan Sankaralingam 2011Dynamically specialized datapaths for energy efficient computing In Proceedings- International Symposium on High-Performance Computer Architecture 503ndash514httpsdoiorg101109HPCA20115749755

              [15] Shay Gueron 2012 Intel Advanced Encryption Standard (AES) Instructions SetTechnical Report Intel Corporation httpssoftwareintelcomsitesdefaultfilesarticle165683aes-wp-2012-09-22-v01pdf

              [16] Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex SolomatnikovBenjamin C Lee Stephen Richardson Christos Kozyrakis and Mark Horowitz2010 Understanding sources of inefficiency in general-purpose chips Proceed-ings of the 37th annual international symposium on Computer architecture - ISCA

              rsquo10 (2010) 37 httpsdoiorg10114518159611815968[17] Mark Hempstead Gu-Yeon Wei and David Brooks 2009 Navigo An early-

              stage model to study power-constrained architectures and specialization In ISCAWorkshop on Modeling Benchmarking and Simulations (MoBS)

              [18] John L Hennessy and David A Patterson 2006 Computer Architecture FourthEdition A Quantitative Approach 704 pages httpsdoiorg10111151881

              [19] Mark D Hill and Michael R Marty 2008 Amdahlrsquos Law in the Multicore EraComputer 41 7 (jul 2008) 33ndash38 httpsdoiorg101109MC2008209

              [20] Sunpyo Hong and Hyesoon Kim 2009 An analytical model for a GPU architec-ture with memory-level and thread-level parallelism awareness In Proceedingsof the 36th Annual International Symposium on Computer Architecture Vol 37152ndash163 httpsdoiorg10114515558151555775

              [21] Sunpyo Hong and Hyesoon Kim 2010 An integrated GPU power and perfor-mance model In Proceedings of the 37th Annual International Symposium onComputer Architecture Vol 38 280mdash-289 httpsdoiorg10114518160381815998

              [22] Haipeng Jia Yunquan Zhang Guoping Long Jianliang Xu Shengen Yan andYan Li 2012 GPURoofline A Model for Guiding Performance Optimizations onGPUs In Proceedings of the 18th International Conference on Parallel Processing(Euro-Parrsquo12) Springer-Verlag Berlin Heidelberg 920ndash932 httpsdoiorg101007978-3-642-32820-6_90

              [23] Onur Kocberber Boris Grot Javier Picorel Babak Falsafi Kevin Lim andParthasarathy Ranganathan 2013 Meet the Walkers Accelerating Index Traver-sals for In-memory Databases In Proceedings of the 46th Annual IEEEACMInternational Symposium on Microarchitecture (MICRO-46) ACM New YorkNY USA 468ndash479 httpsdoiorg10114525407082540748

              [24] Karthik Kumar Jibang Liu Yung Hsiang Lu and Bharat Bhargava 2013 Asurvey of computation offloading for mobile systems Mobile Networks andApplications 18 (2013) 129ndash140 httpsdoiorg101007s11036-012-0368-0

              [25] Snehasish Kumar Naveen Vedula Arrvindh Shriraman and Vijayalakshmi Srini-vasan 2015 DASX Hardware Accelerator for Software Data Structures InProceedings of the 29th ACM on International Conference on Supercomputing(ICS rsquo15) ACM New York NY USA 361ndash372 httpsdoiorg10114527512052751231

              [26] Maysam Lavasani Hari Angepat and Derek Chiou 2014 An FPGA-based In-Line Accelerator for Memcached IEEE Comput Archit Lett 13 2 (jul 2014)57ndash60 httpsdoiorg101109L-CA201317

              [27] John D C Little and Stephen C Graves 2008 Littlersquos law In Building intuitionSpringer 81ndash100

              [28] U Lopez-Novoa A Mendiburu and J Miguel-Alonso 2015 A Survey of Perfor-mance Modeling and Simulation Techniques for Accelerator-Based ComputingParallel and Distributed Systems IEEE Transactions on 26 1 (jan 2015) 272ndash281httpsdoiorg101109TPDS20142308216

              [29] M R Meswani L Carrington D Unat A Snavely S Baden and S Poole2013 Modeling and predicting performance of high performance computingapplications on hardware accelerators International Journal of High Perfor-mance Computing Applications 27 (2013) 89ndash108 httpsdoiorg1011771094342012468180

              [30] National Institute of Standards and Technology 2001 Advanced EncryptionStandard (AES) National Institute of Standards and Technology httpsdoiorg106028NISTFIPS197

              [31] National Institute of Standards and Technology 2008 Secure Hash StandardNational Institute of Standards and Technology httpcsrcnistgovpublicationsfipsfips180-3fips180-3_finalpdf

              [32] S Nilakantan S Battle and M Hempstead 2013 Metrics for Early-Stage Model-ing of Many-Accelerator Architectures Computer Architecture Letters 12 1 (jan2013) 25ndash28 httpsdoiorg101109L-CA20129

              [33] Cedric Nugteren and Henk Corporaal 2012 The Boat Hull Model EnablingPerformance Prediction for Parallel Computing Prior to Code Development Cate-gories and Subject Descriptors In Proceedings of the 9th Conference on Comput-ing Frontiers ACM 203mdash-212

              [34] OpenSSL Software Foundation 2015 OpenSSL Cryptography and SSLTLSToolkit OpenSSL Software Foundation httpsopensslorg

              [35] Sanjay Patel 2009 Sunrsquos Next-Generation Multithreaded Processor RainbowFalls In 21st Hot Chip Symposium httpwwwhotchipsorgwp-contentuploadshc

              [36] Sanjay Patel and Wen-mei W Hwu 2008 Accelerator Architectures IEEE Micro28 4 (jul 2008) 4ndash12 httpsdoiorg101109MM200850

              [37] Stephen Phillips 2014 M7 Next Generation SPARC In 26th Hot Chip Sympo-sium

              [38] Phil Rogers 2013 Heterogeneous system architecture overview In Hot Chips[39] Yoshiei Sato Ryuichi Nagaoka Akihiro Musa Ryusuke Egawa Hiroyuki Tak-

              izawa Koki Okabe and Hiroaki Kobayashi 2009 Performance tuning andanalysis of future vector processors based on the roofline model Proceedingsof the 10th MEDEA workshop on MEmory performance DEaling with Applica-tions systems and architecture - MEDEA rsquo09 (2009) 7 httpsdoiorg10114516219601621962

              [40] M Shah J Barren J Brooks R Golla G Grohoski N Gura R Hetherington PJordan M Luttrell C Olson B Sana D Sheahan L Spracklen and A Wynn2007 UltraSPARC T2 A highly-treaded power-efficient SPARC SOC InSolid-State Circuits Conference 2007 ASSCC rsquo07 IEEE Asian 22ndash25 httpsdoiorg101109ASSCC20074425786

              [41] Manish Shah Robert Golla Gregory Grohoski Paul Jordan Jama Barreh JeffreyBrooks Mark Greenberg Gideon Levinsky Mark Luttrell Christopher OlsonZeid Samoail Matt Smittle and Thomas Ziaja 2012 Sparc T4 A dynamicallythreaded server-on-a-chip IEEE Micro 32 (2012) 8ndash19 httpsdoiorg101109MM20121

              [42] Yakun Sophia Shao Brandon Reagen Gu-Yeon Wei and David Brooks 2014Aladdin A Pre-RTL Power-Performance Accelerator Simulator Enabling LargeDesign Space Exploration of Customized Architectures In International Sympo-sium on Computer Architecture (ISCA)

              [43] Soekris Engineering 2016 vpn 1401 for Std PCI-sockets Soekris Engineeringhttpsoekriscomproductsvpn-1401html

              [44] Shuaiwen Song Chunyi Su Barry Rountree and Kirk W Cameron 2013 Asimplified and accurate model of power-performance efficiency on emergent GPUarchitectures In Proceedings - IEEE 27th International Parallel and DistributedProcessing Symposium IPDPS 2013 673ndash686 httpsdoiorg101109IPDPS201373

              [45] Jeff Stuecheli 2013 POWER8 In 25th Hot Chip Symposium[46] Ning Sun and Chi-Chang Lin 2007 Using the Cryptographic Accelerators in the

              UltraSPARC T1 and T2 processors Technical Report httpwwworaclecomtechnetworkserver-storagesolarisdocumentation819-5782-150147pdf

              ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

              [47] S Tabik G Ortega and E M Garzoacuten 2014 Performance evaluation of ker-nel fusion BLAS routines on the GPU iterative solvers as case study TheJournal of Supercomputing 70 2 (nov 2014) 577ndash587 httpsdoiorg101007s11227-014-1102-4

              [48] Y C Tay 2013 Analytical Performance Modeling for Computer Systems (2nded) Morgan amp Claypool Publishers

              [49] MB Taylor 2012 Is dark silicon useful harnessing the four horsemen of thecoming dark silicon apocalypse In Design Automation Conference (DAC) 201249th ACMEDACIEEE 1131ndash1136 httpsdoiorg10114522283602228567

              [50] G Venkatesh J Sampson N Goulding S Garcia V Bryksin J Lugo-MartinezS Swanson and M B Taylor 2010 Conservation cores Reducing the energyof mature computations In International Conference on Architectural Supportfor Programming Languages and Operating Systems - ASPLOS 205ndash218 httpsdoiorg10114517360201736044

              [51] Ganesh Venkatesh Jack Sampson Nathan Goulding-Hotta Sravanthi KotaVenkata Michael Bedford Taylor and Steven Swanson 2011 QsCores Trad-ing Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores InProceedings of the 44th Annual IEEEACM International Symposium on Microar-chitecture - MICRO-44 rsquo11 163 httpsdoiorg10114521556202155640

              [52] Guibin Wang Yisong Lin and Wei Yi 2010 Kernel Fusion An EffectiveMethod for Better Power Efficiency on Multithreaded GPU In Green Computingand Communications (GreenCom) 2010 IEEEACM Intrsquol Conference on CyberPhysical and Social Computing (CPSCom) 344ndash350 httpsdoiorg101109GreenCom-CPSCom2010102

              [53] Eric W Weisstein 2015 Newtonrsquos Method From MathWorld ndash A Wolfram WebResource httpmathworldwolframcomNewtonsMethodhtml

              [54] Samuel Williams Andrew Waterman and David Patterson 2009 Roofline aninsightful visual performance model for multicore architectures Commun ACM52 (2009) 65ndash76 httpsdoiorg10114514987651498785

              [55] Lisa Wu Andrea Lottarini Timothy K Paine Martha A Kim and Kenneth ARoss 2014 Q100 The Architecture and Design of a Database Processing UnitIn Proceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS rsquo14) ACM NewYork NY USA 255ndash268 httpsdoiorg10114525419402541961

              [56] Moein Pahlavan Yali 2014 FPGA-Roofline An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems PhD DissertationVirginia Polytechnic Institute and State University

              [57] Yao Zhang and John D Owens 2011 A quantitative performance analysismodel for GPU architectures In Proceedings - International Symposium on High-Performance Computer Architecture 382ndash393 httpsdoiorg101109HPCA20115749745

              • Abstract
              • 1 Introduction
              • 2 The LogCA Model
                • 21 Effect of Granularity
                • 22 Performance Metrics
                • 23 Granularity dependent latency
                  • 3 Applications of LogCA
                    • 31 Performance Bounds
                    • 32 Sensitivity Analysis
                      • 4 Experimental Methodology
                      • 5 Evaluation
                        • 51 Linear-Complexity Kernels (= 1)
                        • 52 Super-Linear Complexity Kernels (gt 1)
                        • 53 Sub-Linear Complexity Kernels (lt 1)
                        • 54 Case Studies
                          • 6 Related Work
                          • 7 Conclusion and Future Work
                          • References

                ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                Table 4 Algorithmic complexity of various kernels with num-ber of elements and granularity The power of g represents β

                for each kernel

                Kernel Algorithmic ComplexityAdvanced Encryption Standard (AES) O (n) O (g101)Secure Hashing Algorithm (SHA) O (n) O (g097)Matrix Multiplication (GEMM) O (n3) O (g17)Fast Fourier Transform (FFT) O (n logn) O (g12)Radix Sort O (kn) O (g094)Binary Search O (logn) O (g014)

                Table 5 Calculated values of LogCA Parameters

                LogCA ParametersDevice Benchmark L o C A

                (cycles) (cycles) (cyclesB)

                Discrete GPU

                AES 174Radix Sort 290GEMM 3times103 2times108 2 30FFT 290Binary Search 116

                APU

                AES 174Radix Sort 290GEMM 15 4times108 2 7FFT 290Binary Search 116

                UltraSPARC T2 AES 1500 29times104 90 19SHA 105times103 72 12

                SPARC T3 AES 1500 27times104 90 12SHA 105times103 72 10

                SPARC T4 AES 500 435 32 12SHA 16times103 32 10

                SPARC T4 instr AES 4 111 32 12SHA 1638 32 10

                Sandy Bridge AES 3 10 35 6

                supports these counters so we use Linux utilities to trace the execu-tion of the crypto instructions [3] We use these execution times todetermine LogCA parameters We calculate these parameters onceand can be later used for different kernels on the same system

                For computational index and β we profile the CPU code on thehost by varying the granularity from 16B to 32MB At each granu-larity we measure the execution time and use regression analysisto determine C and β For overheads we use the observation thatfor very small granularities the execution time for a kernel on anaccelerator is dominated by the overheads ie limgrarr0 T1 (g) ≃ oFor acceleration we use different methods for the on-chip accelera-tors and GPUs For on-chip accelerators we calculate accelerationusing equation (3) and the observation that the speedup curve flat-tens out and approaches acceleration for very large granularitiesHowever for the GPUs we do not use equation (3) as it requirescomputing acceleration for each kernel as each application has adifferent access pattern which affects the speedup So we boundthe maximum performance using the peak flops from the devicespecifications We use the ratio of peak GFLOPs on CPU and GPUie A = Peak GFLOPGPU

                Peak GFLOPCPU Similar to acceleration we use two different

                techniques for calculating latency For the on-chip accelerators we

                run micro-benchmarks and use execution time on host and acceler-ators On the other hand for the GPUs we compute latency usingpeak memory bandwidth of the GPU Similar to Meswani et al [29]we use the following equation for measuring data copying time forthe GPUs L = 1

                BWpeak

                Earlier we develop our model using assumptions of granularityindependent and dependent latencies In our setup we observe thatthe on-chip crypto accelerators and HSA-enabled GPU representaccelerators with granularity independent latency while the off-chipcrypto accelerator and discrete GPUAPU represent the granular-ity dependent accelerators For each accelerator we calculate thespeedup and performance metrics using the respective equations(sect2)

                5 EVALUATIONIn this section we show that LogCA closely captures the behavior forboth off and on-chip accelerators We also list the calculate LogCAparameters in Table 5 To demonstrate the utility of our modelwe also present two case studies In these studies we consider theevolution of interface in SUNOraclersquos crypto accelerators and threedifferent GPU architectures In both cases we elaborate the designchanges using the insights LogCA provides

                51 Linear-Complexity Kernels (β = 1)Figure 7 shows the curve-fitting of LogCA for AES We considerboth off-chip and on-chip accelerators connected through differentinterfaces ranging from PCIe bus to special instructions We observethat the off-chip accelerators and APU unlike on-chip acceleratorsprovide reasonable speedup only at very large granularities We alsoobserve that the achievable speedup is limited by computationalintensity for off-chip accelerators and acceleration for on-chip accel-erators This observation supports earlier implication on the limitsof speedup for granularity independent and dependent latencies inequation (3) and (9) respectively

                Figure 7 also shows that UltraSPARC T2 provides higher speedupsthan Sandy Bridge but it breaks-even at a larger granularity SandyBridge on the other hand breaks-even at very small granularitybut provides limited speedup The discrete GPU with powerful pro-cessing cores has the highest acceleration among others Howeverits observed speedup is less than others due to high overheads andlatencies involved in communicating through the PCIe bus

                We have also marked g1 and g A2for each accelerator in Figure 7

                which help programmers and architects identify the complexity ofthe interface For example g1 for crypto instructions ie SPARCT4 and Sandy Bridge lies on the extreme left while for the off-chipaccelerators g1 lies on the far right It is worth mentioning that wehave marked g a

                2for on-chip accelerators but not for the off-chip

                accelerators For off-chip accelerators computational intensity isless than acceleration and as we have noted in equation (12) thatg A

                2for these designs does not existWe also observe that g1 for the crypto-card connected through

                the PCIe bus does not exist showing that this accelerator does notbreak-even even for large granularities Figure 7 also shows thatg1 for GPU and APU is comparable This observation shows thatdespite being an integrated GPU and not connected to the PCIe bus

                LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                CL

                Sp

                eed

                up

                (a) PCIe crypto

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100A

                g1

                CL

                (b) NVIDIA Discrete GPU

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                g1 gA2

                CL

                (c) AMD Integrated GPU (APU)

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                g1 gA2

                CL

                (d) UltraSPARC T2

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                g1 gA2

                CL

                Granularity (Bytes)

                Sp

                eed

                up

                (e) SPARC T3

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                g1 gA2

                CL

                Granularity (Bytes)

                (f) SPARC T4 engine

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                gA2

                CL

                g1 lt 16B

                Granularity (Bytes)

                (g) SPARC T4 instruction

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                CL

                g1 gA2lt 16B

                Granularity (Bytes)

                (h) AESNI on Sandy Bridge

                observed LogCA

                Figure 7 Speedup curve fittings plots comparing LogCA with the observed values of AES [30]

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                g1 gA2

                CL

                Granularity (Bytes)

                Sp

                eed

                up

                (a) UltraSPARC T2 engine

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                g1 gA2

                CL

                Granularity (Bytes)

                (b) SPARC T3 engine

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                g1 gA2

                CL

                Granularity (Bytes)

                (c) SPARC T4 engine

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                gA2

                g1 lt 16B

                CL

                Granularity (Bytes)

                (d) SPARC T4 instruction

                observed LogCA

                Figure 8 Speedup curve fittings plots comparing LogCA with the observed values of SHA256 [31] LogCA starts following observedvalues after 64B

                APU spends considerable time in copying data from the host todevice memory

                Figure 8 shows the curve fitting for SHA on various on-chipcrypto accelerators We observe that g1 and g A

                2do exist as all of

                these are on-chip accelerators We also observe that the LogCAcurve mostly follows the observed value However it deviates fromthe observed value before 64B This happens because SHA requiresblock size of 64B for hash computation If the block size is less than64B it pads extra bits to make the block size 64B Since LogCAdoes not capture this effect it does not follow the observed speedupfor granularity smaller than 64B

                Figure 9-a shows the speedup curve fitting plots for Radix sortWe observe that LogCA does not follow observed values for smallergranularities on GPU Despite this inaccuracy LogCA accuratelypredicts g1 and g A

                2 We also observe that g A

                2for GPU is higher than

                APU and this observation supports equation (7) that increasingacceleration increases g A

                2

                52 Super-Linear Complexity Kernels (β gt 1)Figures 9-b and 9-c show the speedup curve fitting plots for super-complexity kernels on discrete GPU and APU We observe that ma-trix multiplication with higher complexity (O (g17)) achieves higherspeedup than sort and FFT with lower complexities of O (g) andO (g12) respectively This observation corroborates results fromequation (9) that achievable speedup of higher-complexity algo-rithms is higher than lower-complexity algorithms We also observethat g A

                2does not exist for FFT This happens because as we note in

                equation (12) that for g A2to exist for FFT C

                L should be greater thanA

                12 However Figure 9-c shows that CL is smaller than A

                12 for bothGPU and APU

                53 Sub-Linear Complexity Kernels (β lt 1)Figure 9-d shows the curve fitting for binary search which is asub-linear algorithm (β = 014) We make three observations Firstg1 does not exist even for very large granularities and C

                L lt 1 Thisobservation supports implication (5) that for a sub-linear algorithm

                ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                g1 gA2

                CL

                Sp

                eed

                up

                GPU

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                g1 gA2

                CL

                APU

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                g1 gA2

                GPU

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                g1 gA2

                CL

                APU

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100A

                g1

                CL

                Granularity (Bytes)

                Sp

                eed

                up

                GPU

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                A

                g1

                CL

                Granularity (Bytes)

                APU

                16 128 1K 8K 64

                K51

                2K 4M 32M

                001

                01

                1

                10

                100A

                g1 gA2

                CL

                Granularity (Bytes)

                GPU

                16 128 1K 8K 64

                K51

                2K 4M 32M

                001

                01

                1

                10

                100

                A

                g1 gA2

                CL

                Granularity (Bytes)

                APU

                (a) Radix Sort (b) Matrix Multiplication

                (c) FFT (d) Binary Search

                observed LogCA

                Figure 9 Speedup curve fittings plots comparing LogCA with the observed values of (a) Radix Sort (b) Matrix Multiplication (c) FFTand (d) Binary Search

                of β = 014 CL should be greater than 7 to provide any speedup

                Second for large granularities speedup starts decreasing with anincrease in granularity This observation supports our earlier claimin implication (4) that for systems with granularity dependent la-tencies speedup for sub-linear algorithms asymptotically decreasesThird LogCA deviates from the observed value at large granularitiesThis deviation occurs because LogCA does not model caches Asmentioned earlier LogCA abstracts the caches and memories witha single parameter of latency which does not capture the memory-access pattern accurately Even though LogCA does not accuratelycaptures binary search behavior it still provides an upper bound onthe achievable performance

                54 Case StudiesFigure 10 shows the evolution of crypto accelerators in SPARCarchitectures from the off-chip accelerators in pre-Niagara (Figure 10(a)) to accelerators integrated within the pipeline in SPARC T4(Figure 10 (e)) We observe that latency is absent in the on-chipacceleratorsrsquo optimization regions as these accelerators are closelycoupled with the host We also note that the optimization regionwith overheadsmdashrepresenting the complexity of an acceleratorrsquosinterfacemdashshrinks while the optimization regions with accelerationexpand from Figure 10 (a-e) For example for the off-chip cryptoaccelerator the cut-off granularity for overheads is 256KB whereasit is 128B for the SPARC T4 suggesting a much simpler interface

                Figure 10 (a) shows the optimization regions for the off-chipcrypto accelerator connected through the PCIe bus We note thatoverheads and latencies occupy most of the optimization regionsindicating high overhead OS calls and high-latency data copyingover the PCIe bus as the bottlenecks

                Figure 10 (b) shows the optimization regions for UltraSPARCT2 The large cut-off granularity for overheads at 32KB suggestsa complex interface indicating high overhead OS call creating abottleneck at small granularities The cut-off granularity of 2KB foracceleration suggests that optimizing acceleration is beneficial atlarge granularities

                Figure 10 (d) shows optimization regions for on-chip acceleratoron SPARC T4 There are three optimization regions with the cut-offgranularity for overhead now reduced to only 512B This observationsuggests a considerable improvement in the interface design overSPARC T3 and it is also evident by a smaller g1 We also note thatcut-off granularity for acceleration now decreases to 32B showingan increase in the opportunity for optimizing acceleration

                Figure 10 (e) shows optimization regions for crypto instructionson SPARC T4 We observe that unlike earlier designs it has only twooptimization regions and the speedup approaches the peak accelera-tion at a small granularity of 128B In contrast UltraSPARC T2 andSPARC T3 do not even provide any gains at this granularity We alsoobserve that the cut-off granularity for overheads further reduces to128B suggesting some opportunity for optimization at very smallgranularities The model also shows that the acceleration occupiesthe maximum range for optimization For example optimizing accel-eration provides benefits for granularities greater than 16B The lowoverhead access which LogCA shows is due to the non-privilegedinstruction SPARC T4 uses to access the cryptographic unit whichis integrated within the pipeline

                Figure 11 shows the evolution of memory interface design inGPU architectures It shows the optimization regions for matrixmultiplication on a discrete NVIDIA GPU an AMD integrated GPU(APU) and an integrated AMD GPU with HSA support We observethat matrix multiplication for all three architectures is compute bound

                LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                1000

                A

                CL

                oCLoC

                LC

                oL

                Granularity (Bytes)

                Sp

                eed

                up

                (a) PCIe Crypto Accelerator

                16 128 g1 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                1000

                A

                CL

                oC AoCA

                oA

                Granularity (Bytes)

                (b) UltraSPARC T2

                16 128 g1 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                1000CL

                A

                oC AoCA

                oA

                Granularity (Bytes)

                (c) SPARC T3

                g112

                8 1K 8K 64K

                512K 4M 32

                M

                01

                1

                10

                100

                1000

                A

                oCA A

                CL

                oA

                Granularity (Bytes)

                Sp

                eed

                up

                (d) SPARC T4 engine

                16 128 1K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                1000

                A

                oCA A

                CL

                oA

                Granularity (Bytes)

                (e) SPARC T4 instruction

                LogCA L110xo110x C10x A10x

                Figure 10 LogCA for performing Advanced Encryption Standard on various crypto accelerators LogCA identifies the design bottle-necks through LogCA parameters in an optimization region The bottlenecks which LogCA suggests in each design is optimized inthe next design

                16 128 1K 8K g1

                64K

                512K 4M 32

                M

                01

                1

                10

                100

                1000

                A

                LoC LCALC A

                ALC

                o

                Granularity (Bytes)

                Sp

                eed

                up

                (a) NVIDIA Discrete GPU

                16 128 1K 8K g1

                64K

                512K 4M 32

                M

                01

                1

                10

                100

                1000

                A

                LoCLo

                CA

                ACA

                AoL

                C

                Granularity (Bytes)

                (b) AMD Integrated GPU (APU)

                16 128 g11K 8K 64

                K51

                2K 4M 32M

                01

                1

                10

                100

                1000

                A

                oC o

                CA

                CA

                o

                AC

                Granularity (Bytes)

                (c) HSA supported AMD Integrated GPU

                LogCA L110xo110x C10x A10x

                Figure 11 Various Optimization regions for matrix multiplication over a range of granularities on (a) NVIDIA discrete GPU (b)AMD APU and (c) HSA Supported GPU

                (sect31) We also observe that the computational index occupies mostof the regions which signifies maximum optimization potential

                The discrete GPU has four optimization regions (Figure 11 (a))Among these latency dominates most of the regions signifyinghigh-latency data copying over the PCIe bus and thus maximumoptimization potential The high cut-off granularity for overheads at

                32KB indicates high overhead OS calls to access the GPU Similarlywith highly aggressive cores acceleration has high cut-off granular-ity of 256KB indicating less optimization potential for acceleration

                Similar to the discrete GPU the APU also has four optimiza-tion regions (Figure 11 (b)) There are few notable differences ascompared to the discrete GPU The cut-off granularity for latency

                ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                reduces to 512KB with the elimination of data copying over thePCIe bus the overheads are still high suggesting high overhead OScalls to access the APU with less aggressive cores the cut-off granu-larity for acceleration reduces to 64KB implying more optimizationpotential for acceleration

                Figure 11 (c) shows three optimization regions for the HSA en-abled integrated GPU We observe that latency is absent in all regionsand the cut-off granularity for overhead reduces to 8KB These re-ductions in overheads and latencies signify a simpler interface ascompared to the discrete GPU and APU We also observe that thecut-off granularity for acceleration drops to 2KB suggesting higherpotential for optimizing acceleration

                6 RELATED WORKWe compare and contrast our work with prior approaches Lopez-Novoa et al [28] provide a detailed survey of various acceleratormodeling techniques We broadly classify these techniques in twocategories and discuss the most relevant work

                Analytical Models There is a rich body of work exploring ana-lytical models for performance prediction of accelerators For somemodels the motivation is to determine the future trend in heteroge-neous architectures Chung et al [7] in a detailed study predict thefuture landscape of heterogeneous computing Hempstead et al [17]propose an early-stage model Navigo that determines the fraction ofarea required for accelerators to maintain the traditional performancetrend Nilakantan et al [32] propose to incorporate communicationcost for early-stage model of accelerator-rich architectures For oth-ers the motivation is to determine the right amount of data to offloadand the potential benefits associated with an accelerator [24]

                Some analytical models are architecture specific For examplea number of studies [20 21 44 57] predict performance of GPUarchitectures Hong et al [20] present an analytical performancemodel for predicting execution time on GPUs They later extendtheir model and develop an integrated power and performance modelfor the GPUs [21] Song et al [44] use a simple counter basedapproach to predict power and performance Meswani et al [29]explore such models for high performance applications Daga etal [11] analyze the effectiveness of Accelerated processing units(APU) over GPUs and describe the communication cost over thePCIe bus as a major bottleneck in exploiting the full potential ofGPUs

                In general our work is different from these studies because ofthe complexity These models use a large number of parameters toaccurately predict the power andor performance whereas we limitthe number of parameters to reduce the complexity of our modelThey also require deep understanding of the underlying architectureMost of these models also require access to GPU specific assemblyor PTX codes Unlike these approaches we use CPU code to providebounds on the performance

                Roofline Models In terms of simplicity and motivation our workclosely matches the Roofline model [54]mdasha visual performancemodel for multi-core architectures Roofline exposes bottlenecks fora kernel and suggests several optimizations which programmers canuse to fine tune the kernel on a given system

                A number of extensions of Roofline have been proposed [1022 33 56] and some of these extensions are architecture specific

                For example targeting GPUs [22] vector processors [39] and FP-GAs [10 56]

                Despite the similarities roofline and its extensions cannot be usedfor exposing design bottlenecks in an acceleratorrsquos interface Theprimary goal of roofline models has been to help programmers andcompiler writer while LogCA provides more insights for architects

                7 CONCLUSION AND FUTURE WORKWith the recent trend towards heterogeneous computing we feelthat the architecture community lacks a model to reason about theneed of accelerators In this respect we propose LogCAmdashan insight-ful visual performance model for hardware accelerators LogCAprovides insights early in the design stage to both architects andprogrammers and identifies performance bounds exposes interfacedesign bottlenecks and suggest optimizations to alleviate these bot-tlenecks We have validated our model across a range of on-chip andoff-chip accelerators and have shown its utility using retrospectivestudies describing the evolution of acceleratorrsquos interface in thesearchitectures

                The applicability of LogCA can be limited by our simplifying as-sumptions and for more realistic analysis we plan to overcome theselimitations in our future work For example We also assume a singleaccelerator system and do not explicitly model contention amongresources Our model should handle multi-accelerator and pipelinedscenarios For fixed function accelerators our design space is cur-rently limited to encryption and hashing kernels To overcome thiswe are expanding our design space with compression and databaseaccelerators in Oracle M7 processor We also plan to complementLogCA with an energy model as energy efficiency is a prime designmetric for accelerators

                ACKNOWLEDGEMENTSWe thank our anonymous reviewers Arkaprava Basu and TonyNowatzki for their insightful comments and feedback on the paperThanks to Mark Hill Michael Swift Wisconsin Computer Archi-tecture Affiliates and other members of the Multifacet group fortheir valuable discussions We also thank Brian Wilson at Universityof Wisconsin DoIT and Eric Sedlar at Oracle Labs for providingaccess to SPARC T3 and T4 servers respectively Also thanks toMuhammad Umair Bin Altaf for his help in the formulation Thiswork is supported in part by the National Science Foundation (CNS-1302260 CCF-1438992 CCF-1533885 CCF- 1617824) Googleand the University of Wisconsin-Madison (Amar and Balindar SohiProfessorship in Computer Science) Wood has a significant financialinterest in AMD and Google

                REFERENCES[1] Advanced Micro Devices 2016 APP SDK - A Complete Development Platform

                Advanced Micro Devices httpdeveloperamdcomtools-and-sdksopencl-zoneamd-accelerated-parallel-processing-app-sdk

                [2] Gene M Amdahl 1967 Validity of the single processor approach to achievinglarge scale computing capabilities Proceedings of the April 18-20 1967 springjoint computer conference on - AFIPS rsquo67 (Spring) (1967) 483 httpsdoiorg10114514654821465560

                [3] Dan Anderson 2012 How to tell if SPARC T4 crypto is being used httpsblogsoraclecomDanXentryhow_to_tell_if_sparc

                [4] Krste Asanovic Rastislav Bodik James Demmel Tony Keaveny Kurt KeutzerJohn Kubiatowicz Nelson Morgan David Patterson Koushik Sen John

                LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

                Wawrzynek David Wessel and Katherine Yelick 2009 A View of the Par-allel Computing Landscape Commun ACM 52 10 (oct 2009) 56ndash67 httpsdoiorg10114515627641562783

                [5] Nathan Beckmann and Daniel Sanchez 2016 Cache Calculus Modeling Cachesthrough Differential Equations Computer Architecture Letters PP 99 (2016)1 httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=7366753$delimiter026E30F$npapers3publicationdoi101109LCA20152512873

                [6] C Cascaval S Chatterjee H Franke K J Gildea and P Pattnaik 2010 Ataxonomy of accelerator architectures and their programming models IBMJournal of Research and Development 54 (2010) 51ndash510 httpsdoiorg101147JRD20102059721

                [7] Eric S Chung Peter a Milder James C Hoe and Ken Mai 2010 Single-ChipHeterogeneous Computing Does the Future Include Custom Logic FPGAs andGPGPUs 2010 43rd Annual IEEEACM International Symposium on Microar-chitecture (dec 2010) 225ndash236 httpsdoiorg101109MICRO201036

                [8] Jason Cong Zhenman Fang Michael Gill and Glenn Reinman 2015 PARADEA Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Archi-tectural Design and Exploration In 2015 IEEEACM International Conference onComputer-Aided Design Austin TX

                [9] D Culler R Karp D Patterson and A Sahay 1993 LogP Towards a realisticmodel of parallel computation In Proceedings of the Fourth ACM SIGPLANSymposium on Principles and Practice of Parallel Programming 1ndash12 httpdlacmorgcitationcfmid=155333

                [10] Bruno da Silva An Braeken Erik H DrsquoHollander and Abdellah Touhafi 2013Performance Modeling for FPGAs Extending the Roofline Model with High-level Synthesis Tools Int J Reconfig Comput 2013 (jan 2013) 77mdash-77httpsdoiorg1011552013428078

                [11] Mayank Daga Ashwin M Aji and Wu-chun Feng 2011 On the Efficacy of aFused CPU+GPU Processor (or APU) for Parallel Computing In 2011 Symposiumon Application Accelerators in High-Performance Computing Ieee 141ndash149httpsdoiorg101109SAAHPC201129

                [12] Hadi Esmaeilzadeh Emily Blem Renee St Amant Karthikeyan Sankaralingamand Doug Burger 2011 Dark silicon and the end of multicore scaling InProceeding of the 38th annual international symposium on Computer archi-tecture - ISCA rsquo11 ACM Press New York New York USA 365 httpsdoiorg10114520000642000108

                [13] H Franke J Xenidis C Basso B M Bass S S Woodward J D Brown andC L Johnson 2010 Introduction to the wire-speed processor and architectureIBM Journal of Research and Development 54 (2010) 31ndash311 httpsdoiorg101147JRD20092036980

                [14] Venkatraman Govindaraju Chen Han Ho and Karthikeyan Sankaralingam 2011Dynamically specialized datapaths for energy efficient computing In Proceedings- International Symposium on High-Performance Computer Architecture 503ndash514httpsdoiorg101109HPCA20115749755

                [15] Shay Gueron 2012 Intel Advanced Encryption Standard (AES) Instructions SetTechnical Report Intel Corporation httpssoftwareintelcomsitesdefaultfilesarticle165683aes-wp-2012-09-22-v01pdf

                [16] Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex SolomatnikovBenjamin C Lee Stephen Richardson Christos Kozyrakis and Mark Horowitz2010 Understanding sources of inefficiency in general-purpose chips Proceed-ings of the 37th annual international symposium on Computer architecture - ISCA

                rsquo10 (2010) 37 httpsdoiorg10114518159611815968[17] Mark Hempstead Gu-Yeon Wei and David Brooks 2009 Navigo An early-

                stage model to study power-constrained architectures and specialization In ISCAWorkshop on Modeling Benchmarking and Simulations (MoBS)

                [18] John L Hennessy and David A Patterson 2006 Computer Architecture FourthEdition A Quantitative Approach 704 pages httpsdoiorg10111151881

                [19] Mark D Hill and Michael R Marty 2008 Amdahlrsquos Law in the Multicore EraComputer 41 7 (jul 2008) 33ndash38 httpsdoiorg101109MC2008209

                [20] Sunpyo Hong and Hyesoon Kim 2009 An analytical model for a GPU architec-ture with memory-level and thread-level parallelism awareness In Proceedingsof the 36th Annual International Symposium on Computer Architecture Vol 37152ndash163 httpsdoiorg10114515558151555775

                [21] Sunpyo Hong and Hyesoon Kim 2010 An integrated GPU power and perfor-mance model In Proceedings of the 37th Annual International Symposium onComputer Architecture Vol 38 280mdash-289 httpsdoiorg10114518160381815998

                [22] Haipeng Jia Yunquan Zhang Guoping Long Jianliang Xu Shengen Yan andYan Li 2012 GPURoofline A Model for Guiding Performance Optimizations onGPUs In Proceedings of the 18th International Conference on Parallel Processing(Euro-Parrsquo12) Springer-Verlag Berlin Heidelberg 920ndash932 httpsdoiorg101007978-3-642-32820-6_90

                [23] Onur Kocberber Boris Grot Javier Picorel Babak Falsafi Kevin Lim andParthasarathy Ranganathan 2013 Meet the Walkers Accelerating Index Traver-sals for In-memory Databases In Proceedings of the 46th Annual IEEEACMInternational Symposium on Microarchitecture (MICRO-46) ACM New YorkNY USA 468ndash479 httpsdoiorg10114525407082540748

                [24] Karthik Kumar Jibang Liu Yung Hsiang Lu and Bharat Bhargava 2013 Asurvey of computation offloading for mobile systems Mobile Networks andApplications 18 (2013) 129ndash140 httpsdoiorg101007s11036-012-0368-0

                [25] Snehasish Kumar Naveen Vedula Arrvindh Shriraman and Vijayalakshmi Srini-vasan 2015 DASX Hardware Accelerator for Software Data Structures InProceedings of the 29th ACM on International Conference on Supercomputing(ICS rsquo15) ACM New York NY USA 361ndash372 httpsdoiorg10114527512052751231

                [26] Maysam Lavasani Hari Angepat and Derek Chiou 2014 An FPGA-based In-Line Accelerator for Memcached IEEE Comput Archit Lett 13 2 (jul 2014)57ndash60 httpsdoiorg101109L-CA201317

                [27] John D C Little and Stephen C Graves 2008 Littlersquos law In Building intuitionSpringer 81ndash100

                [28] U Lopez-Novoa A Mendiburu and J Miguel-Alonso 2015 A Survey of Perfor-mance Modeling and Simulation Techniques for Accelerator-Based ComputingParallel and Distributed Systems IEEE Transactions on 26 1 (jan 2015) 272ndash281httpsdoiorg101109TPDS20142308216

                [29] M R Meswani L Carrington D Unat A Snavely S Baden and S Poole2013 Modeling and predicting performance of high performance computingapplications on hardware accelerators International Journal of High Perfor-mance Computing Applications 27 (2013) 89ndash108 httpsdoiorg1011771094342012468180

                [30] National Institute of Standards and Technology 2001 Advanced EncryptionStandard (AES) National Institute of Standards and Technology httpsdoiorg106028NISTFIPS197

                [31] National Institute of Standards and Technology 2008 Secure Hash StandardNational Institute of Standards and Technology httpcsrcnistgovpublicationsfipsfips180-3fips180-3_finalpdf

                [32] S Nilakantan S Battle and M Hempstead 2013 Metrics for Early-Stage Model-ing of Many-Accelerator Architectures Computer Architecture Letters 12 1 (jan2013) 25ndash28 httpsdoiorg101109L-CA20129

                [33] Cedric Nugteren and Henk Corporaal 2012 The Boat Hull Model EnablingPerformance Prediction for Parallel Computing Prior to Code Development Cate-gories and Subject Descriptors In Proceedings of the 9th Conference on Comput-ing Frontiers ACM 203mdash-212

                [34] OpenSSL Software Foundation 2015 OpenSSL Cryptography and SSLTLSToolkit OpenSSL Software Foundation httpsopensslorg

                [35] Sanjay Patel 2009 Sunrsquos Next-Generation Multithreaded Processor RainbowFalls In 21st Hot Chip Symposium httpwwwhotchipsorgwp-contentuploadshc

                [36] Sanjay Patel and Wen-mei W Hwu 2008 Accelerator Architectures IEEE Micro28 4 (jul 2008) 4ndash12 httpsdoiorg101109MM200850

                [37] Stephen Phillips 2014 M7 Next Generation SPARC In 26th Hot Chip Sympo-sium

                [38] Phil Rogers 2013 Heterogeneous system architecture overview In Hot Chips[39] Yoshiei Sato Ryuichi Nagaoka Akihiro Musa Ryusuke Egawa Hiroyuki Tak-

                izawa Koki Okabe and Hiroaki Kobayashi 2009 Performance tuning andanalysis of future vector processors based on the roofline model Proceedingsof the 10th MEDEA workshop on MEmory performance DEaling with Applica-tions systems and architecture - MEDEA rsquo09 (2009) 7 httpsdoiorg10114516219601621962

                [40] M Shah J Barren J Brooks R Golla G Grohoski N Gura R Hetherington PJordan M Luttrell C Olson B Sana D Sheahan L Spracklen and A Wynn2007 UltraSPARC T2 A highly-treaded power-efficient SPARC SOC InSolid-State Circuits Conference 2007 ASSCC rsquo07 IEEE Asian 22ndash25 httpsdoiorg101109ASSCC20074425786

                [41] Manish Shah Robert Golla Gregory Grohoski Paul Jordan Jama Barreh JeffreyBrooks Mark Greenberg Gideon Levinsky Mark Luttrell Christopher OlsonZeid Samoail Matt Smittle and Thomas Ziaja 2012 Sparc T4 A dynamicallythreaded server-on-a-chip IEEE Micro 32 (2012) 8ndash19 httpsdoiorg101109MM20121

                [42] Yakun Sophia Shao Brandon Reagen Gu-Yeon Wei and David Brooks 2014Aladdin A Pre-RTL Power-Performance Accelerator Simulator Enabling LargeDesign Space Exploration of Customized Architectures In International Sympo-sium on Computer Architecture (ISCA)

                [43] Soekris Engineering 2016 vpn 1401 for Std PCI-sockets Soekris Engineeringhttpsoekriscomproductsvpn-1401html

                [44] Shuaiwen Song Chunyi Su Barry Rountree and Kirk W Cameron 2013 Asimplified and accurate model of power-performance efficiency on emergent GPUarchitectures In Proceedings - IEEE 27th International Parallel and DistributedProcessing Symposium IPDPS 2013 673ndash686 httpsdoiorg101109IPDPS201373

                [45] Jeff Stuecheli 2013 POWER8 In 25th Hot Chip Symposium[46] Ning Sun and Chi-Chang Lin 2007 Using the Cryptographic Accelerators in the

                UltraSPARC T1 and T2 processors Technical Report httpwwworaclecomtechnetworkserver-storagesolarisdocumentation819-5782-150147pdf

                ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                [47] S Tabik G Ortega and E M Garzoacuten 2014 Performance evaluation of ker-nel fusion BLAS routines on the GPU iterative solvers as case study TheJournal of Supercomputing 70 2 (nov 2014) 577ndash587 httpsdoiorg101007s11227-014-1102-4

                [48] Y C Tay 2013 Analytical Performance Modeling for Computer Systems (2nded) Morgan amp Claypool Publishers

                [49] MB Taylor 2012 Is dark silicon useful harnessing the four horsemen of thecoming dark silicon apocalypse In Design Automation Conference (DAC) 201249th ACMEDACIEEE 1131ndash1136 httpsdoiorg10114522283602228567

                [50] G Venkatesh J Sampson N Goulding S Garcia V Bryksin J Lugo-MartinezS Swanson and M B Taylor 2010 Conservation cores Reducing the energyof mature computations In International Conference on Architectural Supportfor Programming Languages and Operating Systems - ASPLOS 205ndash218 httpsdoiorg10114517360201736044

                [51] Ganesh Venkatesh Jack Sampson Nathan Goulding-Hotta Sravanthi KotaVenkata Michael Bedford Taylor and Steven Swanson 2011 QsCores Trad-ing Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores InProceedings of the 44th Annual IEEEACM International Symposium on Microar-chitecture - MICRO-44 rsquo11 163 httpsdoiorg10114521556202155640

                [52] Guibin Wang Yisong Lin and Wei Yi 2010 Kernel Fusion An EffectiveMethod for Better Power Efficiency on Multithreaded GPU In Green Computingand Communications (GreenCom) 2010 IEEEACM Intrsquol Conference on CyberPhysical and Social Computing (CPSCom) 344ndash350 httpsdoiorg101109GreenCom-CPSCom2010102

                [53] Eric W Weisstein 2015 Newtonrsquos Method From MathWorld ndash A Wolfram WebResource httpmathworldwolframcomNewtonsMethodhtml

                [54] Samuel Williams Andrew Waterman and David Patterson 2009 Roofline aninsightful visual performance model for multicore architectures Commun ACM52 (2009) 65ndash76 httpsdoiorg10114514987651498785

                [55] Lisa Wu Andrea Lottarini Timothy K Paine Martha A Kim and Kenneth ARoss 2014 Q100 The Architecture and Design of a Database Processing UnitIn Proceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS rsquo14) ACM NewYork NY USA 255ndash268 httpsdoiorg10114525419402541961

                [56] Moein Pahlavan Yali 2014 FPGA-Roofline An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems PhD DissertationVirginia Polytechnic Institute and State University

                [57] Yao Zhang and John D Owens 2011 A quantitative performance analysismodel for GPU architectures In Proceedings - International Symposium on High-Performance Computer Architecture 382ndash393 httpsdoiorg101109HPCA20115749745

                • Abstract
                • 1 Introduction
                • 2 The LogCA Model
                  • 21 Effect of Granularity
                  • 22 Performance Metrics
                  • 23 Granularity dependent latency
                    • 3 Applications of LogCA
                      • 31 Performance Bounds
                      • 32 Sensitivity Analysis
                        • 4 Experimental Methodology
                        • 5 Evaluation
                          • 51 Linear-Complexity Kernels (= 1)
                          • 52 Super-Linear Complexity Kernels (gt 1)
                          • 53 Sub-Linear Complexity Kernels (lt 1)
                          • 54 Case Studies
                            • 6 Related Work
                            • 7 Conclusion and Future Work
                            • References

                  LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  CL

                  Sp

                  eed

                  up

                  (a) PCIe crypto

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100A

                  g1

                  CL

                  (b) NVIDIA Discrete GPU

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  g1 gA2

                  CL

                  (c) AMD Integrated GPU (APU)

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  g1 gA2

                  CL

                  (d) UltraSPARC T2

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  g1 gA2

                  CL

                  Granularity (Bytes)

                  Sp

                  eed

                  up

                  (e) SPARC T3

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  g1 gA2

                  CL

                  Granularity (Bytes)

                  (f) SPARC T4 engine

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  gA2

                  CL

                  g1 lt 16B

                  Granularity (Bytes)

                  (g) SPARC T4 instruction

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  CL

                  g1 gA2lt 16B

                  Granularity (Bytes)

                  (h) AESNI on Sandy Bridge

                  observed LogCA

                  Figure 7 Speedup curve fittings plots comparing LogCA with the observed values of AES [30]

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  g1 gA2

                  CL

                  Granularity (Bytes)

                  Sp

                  eed

                  up

                  (a) UltraSPARC T2 engine

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  g1 gA2

                  CL

                  Granularity (Bytes)

                  (b) SPARC T3 engine

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  g1 gA2

                  CL

                  Granularity (Bytes)

                  (c) SPARC T4 engine

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  gA2

                  g1 lt 16B

                  CL

                  Granularity (Bytes)

                  (d) SPARC T4 instruction

                  observed LogCA

                  Figure 8 Speedup curve fittings plots comparing LogCA with the observed values of SHA256 [31] LogCA starts following observedvalues after 64B

                  APU spends considerable time in copying data from the host todevice memory

                  Figure 8 shows the curve fitting for SHA on various on-chipcrypto accelerators We observe that g1 and g A

                  2do exist as all of

                  these are on-chip accelerators We also observe that the LogCAcurve mostly follows the observed value However it deviates fromthe observed value before 64B This happens because SHA requiresblock size of 64B for hash computation If the block size is less than64B it pads extra bits to make the block size 64B Since LogCAdoes not capture this effect it does not follow the observed speedupfor granularity smaller than 64B

                  Figure 9-a shows the speedup curve fitting plots for Radix sortWe observe that LogCA does not follow observed values for smallergranularities on GPU Despite this inaccuracy LogCA accuratelypredicts g1 and g A

                  2 We also observe that g A

                  2for GPU is higher than

                  APU and this observation supports equation (7) that increasingacceleration increases g A

                  2

                  52 Super-Linear Complexity Kernels (β gt 1)Figures 9-b and 9-c show the speedup curve fitting plots for super-complexity kernels on discrete GPU and APU We observe that ma-trix multiplication with higher complexity (O (g17)) achieves higherspeedup than sort and FFT with lower complexities of O (g) andO (g12) respectively This observation corroborates results fromequation (9) that achievable speedup of higher-complexity algo-rithms is higher than lower-complexity algorithms We also observethat g A

                  2does not exist for FFT This happens because as we note in

                  equation (12) that for g A2to exist for FFT C

                  L should be greater thanA

                  12 However Figure 9-c shows that CL is smaller than A

                  12 for bothGPU and APU

                  53 Sub-Linear Complexity Kernels (β lt 1)Figure 9-d shows the curve fitting for binary search which is asub-linear algorithm (β = 014) We make three observations Firstg1 does not exist even for very large granularities and C

                  L lt 1 Thisobservation supports implication (5) that for a sub-linear algorithm

                  ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  g1 gA2

                  CL

                  Sp

                  eed

                  up

                  GPU

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  g1 gA2

                  CL

                  APU

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  g1 gA2

                  GPU

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  g1 gA2

                  CL

                  APU

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100A

                  g1

                  CL

                  Granularity (Bytes)

                  Sp

                  eed

                  up

                  GPU

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  A

                  g1

                  CL

                  Granularity (Bytes)

                  APU

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  001

                  01

                  1

                  10

                  100A

                  g1 gA2

                  CL

                  Granularity (Bytes)

                  GPU

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  001

                  01

                  1

                  10

                  100

                  A

                  g1 gA2

                  CL

                  Granularity (Bytes)

                  APU

                  (a) Radix Sort (b) Matrix Multiplication

                  (c) FFT (d) Binary Search

                  observed LogCA

                  Figure 9 Speedup curve fittings plots comparing LogCA with the observed values of (a) Radix Sort (b) Matrix Multiplication (c) FFTand (d) Binary Search

                  of β = 014 CL should be greater than 7 to provide any speedup

                  Second for large granularities speedup starts decreasing with anincrease in granularity This observation supports our earlier claimin implication (4) that for systems with granularity dependent la-tencies speedup for sub-linear algorithms asymptotically decreasesThird LogCA deviates from the observed value at large granularitiesThis deviation occurs because LogCA does not model caches Asmentioned earlier LogCA abstracts the caches and memories witha single parameter of latency which does not capture the memory-access pattern accurately Even though LogCA does not accuratelycaptures binary search behavior it still provides an upper bound onthe achievable performance

                  54 Case StudiesFigure 10 shows the evolution of crypto accelerators in SPARCarchitectures from the off-chip accelerators in pre-Niagara (Figure 10(a)) to accelerators integrated within the pipeline in SPARC T4(Figure 10 (e)) We observe that latency is absent in the on-chipacceleratorsrsquo optimization regions as these accelerators are closelycoupled with the host We also note that the optimization regionwith overheadsmdashrepresenting the complexity of an acceleratorrsquosinterfacemdashshrinks while the optimization regions with accelerationexpand from Figure 10 (a-e) For example for the off-chip cryptoaccelerator the cut-off granularity for overheads is 256KB whereasit is 128B for the SPARC T4 suggesting a much simpler interface

                  Figure 10 (a) shows the optimization regions for the off-chipcrypto accelerator connected through the PCIe bus We note thatoverheads and latencies occupy most of the optimization regionsindicating high overhead OS calls and high-latency data copyingover the PCIe bus as the bottlenecks

                  Figure 10 (b) shows the optimization regions for UltraSPARCT2 The large cut-off granularity for overheads at 32KB suggestsa complex interface indicating high overhead OS call creating abottleneck at small granularities The cut-off granularity of 2KB foracceleration suggests that optimizing acceleration is beneficial atlarge granularities

                  Figure 10 (d) shows optimization regions for on-chip acceleratoron SPARC T4 There are three optimization regions with the cut-offgranularity for overhead now reduced to only 512B This observationsuggests a considerable improvement in the interface design overSPARC T3 and it is also evident by a smaller g1 We also note thatcut-off granularity for acceleration now decreases to 32B showingan increase in the opportunity for optimizing acceleration

                  Figure 10 (e) shows optimization regions for crypto instructionson SPARC T4 We observe that unlike earlier designs it has only twooptimization regions and the speedup approaches the peak accelera-tion at a small granularity of 128B In contrast UltraSPARC T2 andSPARC T3 do not even provide any gains at this granularity We alsoobserve that the cut-off granularity for overheads further reduces to128B suggesting some opportunity for optimization at very smallgranularities The model also shows that the acceleration occupiesthe maximum range for optimization For example optimizing accel-eration provides benefits for granularities greater than 16B The lowoverhead access which LogCA shows is due to the non-privilegedinstruction SPARC T4 uses to access the cryptographic unit whichis integrated within the pipeline

                  Figure 11 shows the evolution of memory interface design inGPU architectures It shows the optimization regions for matrixmultiplication on a discrete NVIDIA GPU an AMD integrated GPU(APU) and an integrated AMD GPU with HSA support We observethat matrix multiplication for all three architectures is compute bound

                  LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  1000

                  A

                  CL

                  oCLoC

                  LC

                  oL

                  Granularity (Bytes)

                  Sp

                  eed

                  up

                  (a) PCIe Crypto Accelerator

                  16 128 g1 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  1000

                  A

                  CL

                  oC AoCA

                  oA

                  Granularity (Bytes)

                  (b) UltraSPARC T2

                  16 128 g1 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  1000CL

                  A

                  oC AoCA

                  oA

                  Granularity (Bytes)

                  (c) SPARC T3

                  g112

                  8 1K 8K 64K

                  512K 4M 32

                  M

                  01

                  1

                  10

                  100

                  1000

                  A

                  oCA A

                  CL

                  oA

                  Granularity (Bytes)

                  Sp

                  eed

                  up

                  (d) SPARC T4 engine

                  16 128 1K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  1000

                  A

                  oCA A

                  CL

                  oA

                  Granularity (Bytes)

                  (e) SPARC T4 instruction

                  LogCA L110xo110x C10x A10x

                  Figure 10 LogCA for performing Advanced Encryption Standard on various crypto accelerators LogCA identifies the design bottle-necks through LogCA parameters in an optimization region The bottlenecks which LogCA suggests in each design is optimized inthe next design

                  16 128 1K 8K g1

                  64K

                  512K 4M 32

                  M

                  01

                  1

                  10

                  100

                  1000

                  A

                  LoC LCALC A

                  ALC

                  o

                  Granularity (Bytes)

                  Sp

                  eed

                  up

                  (a) NVIDIA Discrete GPU

                  16 128 1K 8K g1

                  64K

                  512K 4M 32

                  M

                  01

                  1

                  10

                  100

                  1000

                  A

                  LoCLo

                  CA

                  ACA

                  AoL

                  C

                  Granularity (Bytes)

                  (b) AMD Integrated GPU (APU)

                  16 128 g11K 8K 64

                  K51

                  2K 4M 32M

                  01

                  1

                  10

                  100

                  1000

                  A

                  oC o

                  CA

                  CA

                  o

                  AC

                  Granularity (Bytes)

                  (c) HSA supported AMD Integrated GPU

                  LogCA L110xo110x C10x A10x

                  Figure 11 Various Optimization regions for matrix multiplication over a range of granularities on (a) NVIDIA discrete GPU (b)AMD APU and (c) HSA Supported GPU

                  (sect31) We also observe that the computational index occupies mostof the regions which signifies maximum optimization potential

                  The discrete GPU has four optimization regions (Figure 11 (a))Among these latency dominates most of the regions signifyinghigh-latency data copying over the PCIe bus and thus maximumoptimization potential The high cut-off granularity for overheads at

                  32KB indicates high overhead OS calls to access the GPU Similarlywith highly aggressive cores acceleration has high cut-off granular-ity of 256KB indicating less optimization potential for acceleration

                  Similar to the discrete GPU the APU also has four optimiza-tion regions (Figure 11 (b)) There are few notable differences ascompared to the discrete GPU The cut-off granularity for latency

                  ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                  reduces to 512KB with the elimination of data copying over thePCIe bus the overheads are still high suggesting high overhead OScalls to access the APU with less aggressive cores the cut-off granu-larity for acceleration reduces to 64KB implying more optimizationpotential for acceleration

                  Figure 11 (c) shows three optimization regions for the HSA en-abled integrated GPU We observe that latency is absent in all regionsand the cut-off granularity for overhead reduces to 8KB These re-ductions in overheads and latencies signify a simpler interface ascompared to the discrete GPU and APU We also observe that thecut-off granularity for acceleration drops to 2KB suggesting higherpotential for optimizing acceleration

                  6 RELATED WORKWe compare and contrast our work with prior approaches Lopez-Novoa et al [28] provide a detailed survey of various acceleratormodeling techniques We broadly classify these techniques in twocategories and discuss the most relevant work

                  Analytical Models There is a rich body of work exploring ana-lytical models for performance prediction of accelerators For somemodels the motivation is to determine the future trend in heteroge-neous architectures Chung et al [7] in a detailed study predict thefuture landscape of heterogeneous computing Hempstead et al [17]propose an early-stage model Navigo that determines the fraction ofarea required for accelerators to maintain the traditional performancetrend Nilakantan et al [32] propose to incorporate communicationcost for early-stage model of accelerator-rich architectures For oth-ers the motivation is to determine the right amount of data to offloadand the potential benefits associated with an accelerator [24]

                  Some analytical models are architecture specific For examplea number of studies [20 21 44 57] predict performance of GPUarchitectures Hong et al [20] present an analytical performancemodel for predicting execution time on GPUs They later extendtheir model and develop an integrated power and performance modelfor the GPUs [21] Song et al [44] use a simple counter basedapproach to predict power and performance Meswani et al [29]explore such models for high performance applications Daga etal [11] analyze the effectiveness of Accelerated processing units(APU) over GPUs and describe the communication cost over thePCIe bus as a major bottleneck in exploiting the full potential ofGPUs

                  In general our work is different from these studies because ofthe complexity These models use a large number of parameters toaccurately predict the power andor performance whereas we limitthe number of parameters to reduce the complexity of our modelThey also require deep understanding of the underlying architectureMost of these models also require access to GPU specific assemblyor PTX codes Unlike these approaches we use CPU code to providebounds on the performance

                  Roofline Models In terms of simplicity and motivation our workclosely matches the Roofline model [54]mdasha visual performancemodel for multi-core architectures Roofline exposes bottlenecks fora kernel and suggests several optimizations which programmers canuse to fine tune the kernel on a given system

                  A number of extensions of Roofline have been proposed [1022 33 56] and some of these extensions are architecture specific

                  For example targeting GPUs [22] vector processors [39] and FP-GAs [10 56]

                  Despite the similarities roofline and its extensions cannot be usedfor exposing design bottlenecks in an acceleratorrsquos interface Theprimary goal of roofline models has been to help programmers andcompiler writer while LogCA provides more insights for architects

                  7 CONCLUSION AND FUTURE WORKWith the recent trend towards heterogeneous computing we feelthat the architecture community lacks a model to reason about theneed of accelerators In this respect we propose LogCAmdashan insight-ful visual performance model for hardware accelerators LogCAprovides insights early in the design stage to both architects andprogrammers and identifies performance bounds exposes interfacedesign bottlenecks and suggest optimizations to alleviate these bot-tlenecks We have validated our model across a range of on-chip andoff-chip accelerators and have shown its utility using retrospectivestudies describing the evolution of acceleratorrsquos interface in thesearchitectures

                  The applicability of LogCA can be limited by our simplifying as-sumptions and for more realistic analysis we plan to overcome theselimitations in our future work For example We also assume a singleaccelerator system and do not explicitly model contention amongresources Our model should handle multi-accelerator and pipelinedscenarios For fixed function accelerators our design space is cur-rently limited to encryption and hashing kernels To overcome thiswe are expanding our design space with compression and databaseaccelerators in Oracle M7 processor We also plan to complementLogCA with an energy model as energy efficiency is a prime designmetric for accelerators

                  ACKNOWLEDGEMENTSWe thank our anonymous reviewers Arkaprava Basu and TonyNowatzki for their insightful comments and feedback on the paperThanks to Mark Hill Michael Swift Wisconsin Computer Archi-tecture Affiliates and other members of the Multifacet group fortheir valuable discussions We also thank Brian Wilson at Universityof Wisconsin DoIT and Eric Sedlar at Oracle Labs for providingaccess to SPARC T3 and T4 servers respectively Also thanks toMuhammad Umair Bin Altaf for his help in the formulation Thiswork is supported in part by the National Science Foundation (CNS-1302260 CCF-1438992 CCF-1533885 CCF- 1617824) Googleand the University of Wisconsin-Madison (Amar and Balindar SohiProfessorship in Computer Science) Wood has a significant financialinterest in AMD and Google

                  REFERENCES[1] Advanced Micro Devices 2016 APP SDK - A Complete Development Platform

                  Advanced Micro Devices httpdeveloperamdcomtools-and-sdksopencl-zoneamd-accelerated-parallel-processing-app-sdk

                  [2] Gene M Amdahl 1967 Validity of the single processor approach to achievinglarge scale computing capabilities Proceedings of the April 18-20 1967 springjoint computer conference on - AFIPS rsquo67 (Spring) (1967) 483 httpsdoiorg10114514654821465560

                  [3] Dan Anderson 2012 How to tell if SPARC T4 crypto is being used httpsblogsoraclecomDanXentryhow_to_tell_if_sparc

                  [4] Krste Asanovic Rastislav Bodik James Demmel Tony Keaveny Kurt KeutzerJohn Kubiatowicz Nelson Morgan David Patterson Koushik Sen John

                  LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

                  Wawrzynek David Wessel and Katherine Yelick 2009 A View of the Par-allel Computing Landscape Commun ACM 52 10 (oct 2009) 56ndash67 httpsdoiorg10114515627641562783

                  [5] Nathan Beckmann and Daniel Sanchez 2016 Cache Calculus Modeling Cachesthrough Differential Equations Computer Architecture Letters PP 99 (2016)1 httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=7366753$delimiter026E30F$npapers3publicationdoi101109LCA20152512873

                  [6] C Cascaval S Chatterjee H Franke K J Gildea and P Pattnaik 2010 Ataxonomy of accelerator architectures and their programming models IBMJournal of Research and Development 54 (2010) 51ndash510 httpsdoiorg101147JRD20102059721

                  [7] Eric S Chung Peter a Milder James C Hoe and Ken Mai 2010 Single-ChipHeterogeneous Computing Does the Future Include Custom Logic FPGAs andGPGPUs 2010 43rd Annual IEEEACM International Symposium on Microar-chitecture (dec 2010) 225ndash236 httpsdoiorg101109MICRO201036

                  [8] Jason Cong Zhenman Fang Michael Gill and Glenn Reinman 2015 PARADEA Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Archi-tectural Design and Exploration In 2015 IEEEACM International Conference onComputer-Aided Design Austin TX

                  [9] D Culler R Karp D Patterson and A Sahay 1993 LogP Towards a realisticmodel of parallel computation In Proceedings of the Fourth ACM SIGPLANSymposium on Principles and Practice of Parallel Programming 1ndash12 httpdlacmorgcitationcfmid=155333

                  [10] Bruno da Silva An Braeken Erik H DrsquoHollander and Abdellah Touhafi 2013Performance Modeling for FPGAs Extending the Roofline Model with High-level Synthesis Tools Int J Reconfig Comput 2013 (jan 2013) 77mdash-77httpsdoiorg1011552013428078

                  [11] Mayank Daga Ashwin M Aji and Wu-chun Feng 2011 On the Efficacy of aFused CPU+GPU Processor (or APU) for Parallel Computing In 2011 Symposiumon Application Accelerators in High-Performance Computing Ieee 141ndash149httpsdoiorg101109SAAHPC201129

                  [12] Hadi Esmaeilzadeh Emily Blem Renee St Amant Karthikeyan Sankaralingamand Doug Burger 2011 Dark silicon and the end of multicore scaling InProceeding of the 38th annual international symposium on Computer archi-tecture - ISCA rsquo11 ACM Press New York New York USA 365 httpsdoiorg10114520000642000108

                  [13] H Franke J Xenidis C Basso B M Bass S S Woodward J D Brown andC L Johnson 2010 Introduction to the wire-speed processor and architectureIBM Journal of Research and Development 54 (2010) 31ndash311 httpsdoiorg101147JRD20092036980

                  [14] Venkatraman Govindaraju Chen Han Ho and Karthikeyan Sankaralingam 2011Dynamically specialized datapaths for energy efficient computing In Proceedings- International Symposium on High-Performance Computer Architecture 503ndash514httpsdoiorg101109HPCA20115749755

                  [15] Shay Gueron 2012 Intel Advanced Encryption Standard (AES) Instructions SetTechnical Report Intel Corporation httpssoftwareintelcomsitesdefaultfilesarticle165683aes-wp-2012-09-22-v01pdf

                  [16] Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex SolomatnikovBenjamin C Lee Stephen Richardson Christos Kozyrakis and Mark Horowitz2010 Understanding sources of inefficiency in general-purpose chips Proceed-ings of the 37th annual international symposium on Computer architecture - ISCA

                  rsquo10 (2010) 37 httpsdoiorg10114518159611815968[17] Mark Hempstead Gu-Yeon Wei and David Brooks 2009 Navigo An early-

                  stage model to study power-constrained architectures and specialization In ISCAWorkshop on Modeling Benchmarking and Simulations (MoBS)

                  [18] John L Hennessy and David A Patterson 2006 Computer Architecture FourthEdition A Quantitative Approach 704 pages httpsdoiorg10111151881

                  [19] Mark D Hill and Michael R Marty 2008 Amdahlrsquos Law in the Multicore EraComputer 41 7 (jul 2008) 33ndash38 httpsdoiorg101109MC2008209

                  [20] Sunpyo Hong and Hyesoon Kim 2009 An analytical model for a GPU architec-ture with memory-level and thread-level parallelism awareness In Proceedingsof the 36th Annual International Symposium on Computer Architecture Vol 37152ndash163 httpsdoiorg10114515558151555775

                  [21] Sunpyo Hong and Hyesoon Kim 2010 An integrated GPU power and perfor-mance model In Proceedings of the 37th Annual International Symposium onComputer Architecture Vol 38 280mdash-289 httpsdoiorg10114518160381815998

                  [22] Haipeng Jia Yunquan Zhang Guoping Long Jianliang Xu Shengen Yan andYan Li 2012 GPURoofline A Model for Guiding Performance Optimizations onGPUs In Proceedings of the 18th International Conference on Parallel Processing(Euro-Parrsquo12) Springer-Verlag Berlin Heidelberg 920ndash932 httpsdoiorg101007978-3-642-32820-6_90

                  [23] Onur Kocberber Boris Grot Javier Picorel Babak Falsafi Kevin Lim andParthasarathy Ranganathan 2013 Meet the Walkers Accelerating Index Traver-sals for In-memory Databases In Proceedings of the 46th Annual IEEEACMInternational Symposium on Microarchitecture (MICRO-46) ACM New YorkNY USA 468ndash479 httpsdoiorg10114525407082540748

                  [24] Karthik Kumar Jibang Liu Yung Hsiang Lu and Bharat Bhargava 2013 Asurvey of computation offloading for mobile systems Mobile Networks andApplications 18 (2013) 129ndash140 httpsdoiorg101007s11036-012-0368-0

                  [25] Snehasish Kumar Naveen Vedula Arrvindh Shriraman and Vijayalakshmi Srini-vasan 2015 DASX Hardware Accelerator for Software Data Structures InProceedings of the 29th ACM on International Conference on Supercomputing(ICS rsquo15) ACM New York NY USA 361ndash372 httpsdoiorg10114527512052751231

                  [26] Maysam Lavasani Hari Angepat and Derek Chiou 2014 An FPGA-based In-Line Accelerator for Memcached IEEE Comput Archit Lett 13 2 (jul 2014)57ndash60 httpsdoiorg101109L-CA201317

                  [27] John D C Little and Stephen C Graves 2008 Littlersquos law In Building intuitionSpringer 81ndash100

                  [28] U Lopez-Novoa A Mendiburu and J Miguel-Alonso 2015 A Survey of Perfor-mance Modeling and Simulation Techniques for Accelerator-Based ComputingParallel and Distributed Systems IEEE Transactions on 26 1 (jan 2015) 272ndash281httpsdoiorg101109TPDS20142308216

                  [29] M R Meswani L Carrington D Unat A Snavely S Baden and S Poole2013 Modeling and predicting performance of high performance computingapplications on hardware accelerators International Journal of High Perfor-mance Computing Applications 27 (2013) 89ndash108 httpsdoiorg1011771094342012468180

                  [30] National Institute of Standards and Technology 2001 Advanced EncryptionStandard (AES) National Institute of Standards and Technology httpsdoiorg106028NISTFIPS197

                  [31] National Institute of Standards and Technology 2008 Secure Hash StandardNational Institute of Standards and Technology httpcsrcnistgovpublicationsfipsfips180-3fips180-3_finalpdf

                  [32] S Nilakantan S Battle and M Hempstead 2013 Metrics for Early-Stage Model-ing of Many-Accelerator Architectures Computer Architecture Letters 12 1 (jan2013) 25ndash28 httpsdoiorg101109L-CA20129

                  [33] Cedric Nugteren and Henk Corporaal 2012 The Boat Hull Model EnablingPerformance Prediction for Parallel Computing Prior to Code Development Cate-gories and Subject Descriptors In Proceedings of the 9th Conference on Comput-ing Frontiers ACM 203mdash-212

                  [34] OpenSSL Software Foundation 2015 OpenSSL Cryptography and SSLTLSToolkit OpenSSL Software Foundation httpsopensslorg

                  [35] Sanjay Patel 2009 Sunrsquos Next-Generation Multithreaded Processor RainbowFalls In 21st Hot Chip Symposium httpwwwhotchipsorgwp-contentuploadshc

                  [36] Sanjay Patel and Wen-mei W Hwu 2008 Accelerator Architectures IEEE Micro28 4 (jul 2008) 4ndash12 httpsdoiorg101109MM200850

                  [37] Stephen Phillips 2014 M7 Next Generation SPARC In 26th Hot Chip Sympo-sium

                  [38] Phil Rogers 2013 Heterogeneous system architecture overview In Hot Chips[39] Yoshiei Sato Ryuichi Nagaoka Akihiro Musa Ryusuke Egawa Hiroyuki Tak-

                  izawa Koki Okabe and Hiroaki Kobayashi 2009 Performance tuning andanalysis of future vector processors based on the roofline model Proceedingsof the 10th MEDEA workshop on MEmory performance DEaling with Applica-tions systems and architecture - MEDEA rsquo09 (2009) 7 httpsdoiorg10114516219601621962

                  [40] M Shah J Barren J Brooks R Golla G Grohoski N Gura R Hetherington PJordan M Luttrell C Olson B Sana D Sheahan L Spracklen and A Wynn2007 UltraSPARC T2 A highly-treaded power-efficient SPARC SOC InSolid-State Circuits Conference 2007 ASSCC rsquo07 IEEE Asian 22ndash25 httpsdoiorg101109ASSCC20074425786

                  [41] Manish Shah Robert Golla Gregory Grohoski Paul Jordan Jama Barreh JeffreyBrooks Mark Greenberg Gideon Levinsky Mark Luttrell Christopher OlsonZeid Samoail Matt Smittle and Thomas Ziaja 2012 Sparc T4 A dynamicallythreaded server-on-a-chip IEEE Micro 32 (2012) 8ndash19 httpsdoiorg101109MM20121

                  [42] Yakun Sophia Shao Brandon Reagen Gu-Yeon Wei and David Brooks 2014Aladdin A Pre-RTL Power-Performance Accelerator Simulator Enabling LargeDesign Space Exploration of Customized Architectures In International Sympo-sium on Computer Architecture (ISCA)

                  [43] Soekris Engineering 2016 vpn 1401 for Std PCI-sockets Soekris Engineeringhttpsoekriscomproductsvpn-1401html

                  [44] Shuaiwen Song Chunyi Su Barry Rountree and Kirk W Cameron 2013 Asimplified and accurate model of power-performance efficiency on emergent GPUarchitectures In Proceedings - IEEE 27th International Parallel and DistributedProcessing Symposium IPDPS 2013 673ndash686 httpsdoiorg101109IPDPS201373

                  [45] Jeff Stuecheli 2013 POWER8 In 25th Hot Chip Symposium[46] Ning Sun and Chi-Chang Lin 2007 Using the Cryptographic Accelerators in the

                  UltraSPARC T1 and T2 processors Technical Report httpwwworaclecomtechnetworkserver-storagesolarisdocumentation819-5782-150147pdf

                  ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                  [47] S Tabik G Ortega and E M Garzoacuten 2014 Performance evaluation of ker-nel fusion BLAS routines on the GPU iterative solvers as case study TheJournal of Supercomputing 70 2 (nov 2014) 577ndash587 httpsdoiorg101007s11227-014-1102-4

                  [48] Y C Tay 2013 Analytical Performance Modeling for Computer Systems (2nded) Morgan amp Claypool Publishers

                  [49] MB Taylor 2012 Is dark silicon useful harnessing the four horsemen of thecoming dark silicon apocalypse In Design Automation Conference (DAC) 201249th ACMEDACIEEE 1131ndash1136 httpsdoiorg10114522283602228567

                  [50] G Venkatesh J Sampson N Goulding S Garcia V Bryksin J Lugo-MartinezS Swanson and M B Taylor 2010 Conservation cores Reducing the energyof mature computations In International Conference on Architectural Supportfor Programming Languages and Operating Systems - ASPLOS 205ndash218 httpsdoiorg10114517360201736044

                  [51] Ganesh Venkatesh Jack Sampson Nathan Goulding-Hotta Sravanthi KotaVenkata Michael Bedford Taylor and Steven Swanson 2011 QsCores Trad-ing Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores InProceedings of the 44th Annual IEEEACM International Symposium on Microar-chitecture - MICRO-44 rsquo11 163 httpsdoiorg10114521556202155640

                  [52] Guibin Wang Yisong Lin and Wei Yi 2010 Kernel Fusion An EffectiveMethod for Better Power Efficiency on Multithreaded GPU In Green Computingand Communications (GreenCom) 2010 IEEEACM Intrsquol Conference on CyberPhysical and Social Computing (CPSCom) 344ndash350 httpsdoiorg101109GreenCom-CPSCom2010102

                  [53] Eric W Weisstein 2015 Newtonrsquos Method From MathWorld ndash A Wolfram WebResource httpmathworldwolframcomNewtonsMethodhtml

                  [54] Samuel Williams Andrew Waterman and David Patterson 2009 Roofline aninsightful visual performance model for multicore architectures Commun ACM52 (2009) 65ndash76 httpsdoiorg10114514987651498785

                  [55] Lisa Wu Andrea Lottarini Timothy K Paine Martha A Kim and Kenneth ARoss 2014 Q100 The Architecture and Design of a Database Processing UnitIn Proceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS rsquo14) ACM NewYork NY USA 255ndash268 httpsdoiorg10114525419402541961

                  [56] Moein Pahlavan Yali 2014 FPGA-Roofline An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems PhD DissertationVirginia Polytechnic Institute and State University

                  [57] Yao Zhang and John D Owens 2011 A quantitative performance analysismodel for GPU architectures In Proceedings - International Symposium on High-Performance Computer Architecture 382ndash393 httpsdoiorg101109HPCA20115749745

                  • Abstract
                  • 1 Introduction
                  • 2 The LogCA Model
                    • 21 Effect of Granularity
                    • 22 Performance Metrics
                    • 23 Granularity dependent latency
                      • 3 Applications of LogCA
                        • 31 Performance Bounds
                        • 32 Sensitivity Analysis
                          • 4 Experimental Methodology
                          • 5 Evaluation
                            • 51 Linear-Complexity Kernels (= 1)
                            • 52 Super-Linear Complexity Kernels (gt 1)
                            • 53 Sub-Linear Complexity Kernels (lt 1)
                            • 54 Case Studies
                              • 6 Related Work
                              • 7 Conclusion and Future Work
                              • References

                    ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                    16 128 1K 8K 64

                    K51

                    2K 4M 32M

                    01

                    1

                    10

                    100

                    A

                    g1 gA2

                    CL

                    Sp

                    eed

                    up

                    GPU

                    16 128 1K 8K 64

                    K51

                    2K 4M 32M

                    01

                    1

                    10

                    100

                    A

                    g1 gA2

                    CL

                    APU

                    16 128 1K 8K 64

                    K51

                    2K 4M 32M

                    01

                    1

                    10

                    100

                    A

                    g1 gA2

                    GPU

                    16 128 1K 8K 64

                    K51

                    2K 4M 32M

                    01

                    1

                    10

                    100

                    A

                    g1 gA2

                    CL

                    APU

                    16 128 1K 8K 64

                    K51

                    2K 4M 32M

                    01

                    1

                    10

                    100A

                    g1

                    CL

                    Granularity (Bytes)

                    Sp

                    eed

                    up

                    GPU

                    16 128 1K 8K 64

                    K51

                    2K 4M 32M

                    01

                    1

                    10

                    100

                    A

                    g1

                    CL

                    Granularity (Bytes)

                    APU

                    16 128 1K 8K 64

                    K51

                    2K 4M 32M

                    001

                    01

                    1

                    10

                    100A

                    g1 gA2

                    CL

                    Granularity (Bytes)

                    GPU

                    16 128 1K 8K 64

                    K51

                    2K 4M 32M

                    001

                    01

                    1

                    10

                    100

                    A

                    g1 gA2

                    CL

                    Granularity (Bytes)

                    APU

                    (a) Radix Sort (b) Matrix Multiplication

                    (c) FFT (d) Binary Search

                    observed LogCA

                    Figure 9 Speedup curve fittings plots comparing LogCA with the observed values of (a) Radix Sort (b) Matrix Multiplication (c) FFTand (d) Binary Search

                    of β = 014 CL should be greater than 7 to provide any speedup

                    Second for large granularities speedup starts decreasing with anincrease in granularity This observation supports our earlier claimin implication (4) that for systems with granularity dependent la-tencies speedup for sub-linear algorithms asymptotically decreasesThird LogCA deviates from the observed value at large granularitiesThis deviation occurs because LogCA does not model caches Asmentioned earlier LogCA abstracts the caches and memories witha single parameter of latency which does not capture the memory-access pattern accurately Even though LogCA does not accuratelycaptures binary search behavior it still provides an upper bound onthe achievable performance

                    54 Case StudiesFigure 10 shows the evolution of crypto accelerators in SPARCarchitectures from the off-chip accelerators in pre-Niagara (Figure 10(a)) to accelerators integrated within the pipeline in SPARC T4(Figure 10 (e)) We observe that latency is absent in the on-chipacceleratorsrsquo optimization regions as these accelerators are closelycoupled with the host We also note that the optimization regionwith overheadsmdashrepresenting the complexity of an acceleratorrsquosinterfacemdashshrinks while the optimization regions with accelerationexpand from Figure 10 (a-e) For example for the off-chip cryptoaccelerator the cut-off granularity for overheads is 256KB whereasit is 128B for the SPARC T4 suggesting a much simpler interface

                    Figure 10 (a) shows the optimization regions for the off-chipcrypto accelerator connected through the PCIe bus We note thatoverheads and latencies occupy most of the optimization regionsindicating high overhead OS calls and high-latency data copyingover the PCIe bus as the bottlenecks

                    Figure 10 (b) shows the optimization regions for UltraSPARCT2 The large cut-off granularity for overheads at 32KB suggestsa complex interface indicating high overhead OS call creating abottleneck at small granularities The cut-off granularity of 2KB foracceleration suggests that optimizing acceleration is beneficial atlarge granularities

                    Figure 10 (d) shows optimization regions for on-chip acceleratoron SPARC T4 There are three optimization regions with the cut-offgranularity for overhead now reduced to only 512B This observationsuggests a considerable improvement in the interface design overSPARC T3 and it is also evident by a smaller g1 We also note thatcut-off granularity for acceleration now decreases to 32B showingan increase in the opportunity for optimizing acceleration

                    Figure 10 (e) shows optimization regions for crypto instructionson SPARC T4 We observe that unlike earlier designs it has only twooptimization regions and the speedup approaches the peak accelera-tion at a small granularity of 128B In contrast UltraSPARC T2 andSPARC T3 do not even provide any gains at this granularity We alsoobserve that the cut-off granularity for overheads further reduces to128B suggesting some opportunity for optimization at very smallgranularities The model also shows that the acceleration occupiesthe maximum range for optimization For example optimizing accel-eration provides benefits for granularities greater than 16B The lowoverhead access which LogCA shows is due to the non-privilegedinstruction SPARC T4 uses to access the cryptographic unit whichis integrated within the pipeline

                    Figure 11 shows the evolution of memory interface design inGPU architectures It shows the optimization regions for matrixmultiplication on a discrete NVIDIA GPU an AMD integrated GPU(APU) and an integrated AMD GPU with HSA support We observethat matrix multiplication for all three architectures is compute bound

                    LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

                    16 128 1K 8K 64

                    K51

                    2K 4M 32M

                    01

                    1

                    10

                    100

                    1000

                    A

                    CL

                    oCLoC

                    LC

                    oL

                    Granularity (Bytes)

                    Sp

                    eed

                    up

                    (a) PCIe Crypto Accelerator

                    16 128 g1 1K 8K 64

                    K51

                    2K 4M 32M

                    01

                    1

                    10

                    100

                    1000

                    A

                    CL

                    oC AoCA

                    oA

                    Granularity (Bytes)

                    (b) UltraSPARC T2

                    16 128 g1 1K 8K 64

                    K51

                    2K 4M 32M

                    01

                    1

                    10

                    100

                    1000CL

                    A

                    oC AoCA

                    oA

                    Granularity (Bytes)

                    (c) SPARC T3

                    g112

                    8 1K 8K 64K

                    512K 4M 32

                    M

                    01

                    1

                    10

                    100

                    1000

                    A

                    oCA A

                    CL

                    oA

                    Granularity (Bytes)

                    Sp

                    eed

                    up

                    (d) SPARC T4 engine

                    16 128 1K 8K 64

                    K51

                    2K 4M 32M

                    01

                    1

                    10

                    100

                    1000

                    A

                    oCA A

                    CL

                    oA

                    Granularity (Bytes)

                    (e) SPARC T4 instruction

                    LogCA L110xo110x C10x A10x

                    Figure 10 LogCA for performing Advanced Encryption Standard on various crypto accelerators LogCA identifies the design bottle-necks through LogCA parameters in an optimization region The bottlenecks which LogCA suggests in each design is optimized inthe next design

                    16 128 1K 8K g1

                    64K

                    512K 4M 32

                    M

                    01

                    1

                    10

                    100

                    1000

                    A

                    LoC LCALC A

                    ALC

                    o

                    Granularity (Bytes)

                    Sp

                    eed

                    up

                    (a) NVIDIA Discrete GPU

                    16 128 1K 8K g1

                    64K

                    512K 4M 32

                    M

                    01

                    1

                    10

                    100

                    1000

                    A

                    LoCLo

                    CA

                    ACA

                    AoL

                    C

                    Granularity (Bytes)

                    (b) AMD Integrated GPU (APU)

                    16 128 g11K 8K 64

                    K51

                    2K 4M 32M

                    01

                    1

                    10

                    100

                    1000

                    A

                    oC o

                    CA

                    CA

                    o

                    AC

                    Granularity (Bytes)

                    (c) HSA supported AMD Integrated GPU

                    LogCA L110xo110x C10x A10x

                    Figure 11 Various Optimization regions for matrix multiplication over a range of granularities on (a) NVIDIA discrete GPU (b)AMD APU and (c) HSA Supported GPU

                    (sect31) We also observe that the computational index occupies mostof the regions which signifies maximum optimization potential

                    The discrete GPU has four optimization regions (Figure 11 (a))Among these latency dominates most of the regions signifyinghigh-latency data copying over the PCIe bus and thus maximumoptimization potential The high cut-off granularity for overheads at

                    32KB indicates high overhead OS calls to access the GPU Similarlywith highly aggressive cores acceleration has high cut-off granular-ity of 256KB indicating less optimization potential for acceleration

                    Similar to the discrete GPU the APU also has four optimiza-tion regions (Figure 11 (b)) There are few notable differences ascompared to the discrete GPU The cut-off granularity for latency

                    ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                    reduces to 512KB with the elimination of data copying over thePCIe bus the overheads are still high suggesting high overhead OScalls to access the APU with less aggressive cores the cut-off granu-larity for acceleration reduces to 64KB implying more optimizationpotential for acceleration

                    Figure 11 (c) shows three optimization regions for the HSA en-abled integrated GPU We observe that latency is absent in all regionsand the cut-off granularity for overhead reduces to 8KB These re-ductions in overheads and latencies signify a simpler interface ascompared to the discrete GPU and APU We also observe that thecut-off granularity for acceleration drops to 2KB suggesting higherpotential for optimizing acceleration

                    6 RELATED WORKWe compare and contrast our work with prior approaches Lopez-Novoa et al [28] provide a detailed survey of various acceleratormodeling techniques We broadly classify these techniques in twocategories and discuss the most relevant work

                    Analytical Models There is a rich body of work exploring ana-lytical models for performance prediction of accelerators For somemodels the motivation is to determine the future trend in heteroge-neous architectures Chung et al [7] in a detailed study predict thefuture landscape of heterogeneous computing Hempstead et al [17]propose an early-stage model Navigo that determines the fraction ofarea required for accelerators to maintain the traditional performancetrend Nilakantan et al [32] propose to incorporate communicationcost for early-stage model of accelerator-rich architectures For oth-ers the motivation is to determine the right amount of data to offloadand the potential benefits associated with an accelerator [24]

                    Some analytical models are architecture specific For examplea number of studies [20 21 44 57] predict performance of GPUarchitectures Hong et al [20] present an analytical performancemodel for predicting execution time on GPUs They later extendtheir model and develop an integrated power and performance modelfor the GPUs [21] Song et al [44] use a simple counter basedapproach to predict power and performance Meswani et al [29]explore such models for high performance applications Daga etal [11] analyze the effectiveness of Accelerated processing units(APU) over GPUs and describe the communication cost over thePCIe bus as a major bottleneck in exploiting the full potential ofGPUs

                    In general our work is different from these studies because ofthe complexity These models use a large number of parameters toaccurately predict the power andor performance whereas we limitthe number of parameters to reduce the complexity of our modelThey also require deep understanding of the underlying architectureMost of these models also require access to GPU specific assemblyor PTX codes Unlike these approaches we use CPU code to providebounds on the performance

                    Roofline Models In terms of simplicity and motivation our workclosely matches the Roofline model [54]mdasha visual performancemodel for multi-core architectures Roofline exposes bottlenecks fora kernel and suggests several optimizations which programmers canuse to fine tune the kernel on a given system

                    A number of extensions of Roofline have been proposed [1022 33 56] and some of these extensions are architecture specific

                    For example targeting GPUs [22] vector processors [39] and FP-GAs [10 56]

                    Despite the similarities roofline and its extensions cannot be usedfor exposing design bottlenecks in an acceleratorrsquos interface Theprimary goal of roofline models has been to help programmers andcompiler writer while LogCA provides more insights for architects

                    7 CONCLUSION AND FUTURE WORKWith the recent trend towards heterogeneous computing we feelthat the architecture community lacks a model to reason about theneed of accelerators In this respect we propose LogCAmdashan insight-ful visual performance model for hardware accelerators LogCAprovides insights early in the design stage to both architects andprogrammers and identifies performance bounds exposes interfacedesign bottlenecks and suggest optimizations to alleviate these bot-tlenecks We have validated our model across a range of on-chip andoff-chip accelerators and have shown its utility using retrospectivestudies describing the evolution of acceleratorrsquos interface in thesearchitectures

                    The applicability of LogCA can be limited by our simplifying as-sumptions and for more realistic analysis we plan to overcome theselimitations in our future work For example We also assume a singleaccelerator system and do not explicitly model contention amongresources Our model should handle multi-accelerator and pipelinedscenarios For fixed function accelerators our design space is cur-rently limited to encryption and hashing kernels To overcome thiswe are expanding our design space with compression and databaseaccelerators in Oracle M7 processor We also plan to complementLogCA with an energy model as energy efficiency is a prime designmetric for accelerators

                    ACKNOWLEDGEMENTSWe thank our anonymous reviewers Arkaprava Basu and TonyNowatzki for their insightful comments and feedback on the paperThanks to Mark Hill Michael Swift Wisconsin Computer Archi-tecture Affiliates and other members of the Multifacet group fortheir valuable discussions We also thank Brian Wilson at Universityof Wisconsin DoIT and Eric Sedlar at Oracle Labs for providingaccess to SPARC T3 and T4 servers respectively Also thanks toMuhammad Umair Bin Altaf for his help in the formulation Thiswork is supported in part by the National Science Foundation (CNS-1302260 CCF-1438992 CCF-1533885 CCF- 1617824) Googleand the University of Wisconsin-Madison (Amar and Balindar SohiProfessorship in Computer Science) Wood has a significant financialinterest in AMD and Google

                    REFERENCES[1] Advanced Micro Devices 2016 APP SDK - A Complete Development Platform

                    Advanced Micro Devices httpdeveloperamdcomtools-and-sdksopencl-zoneamd-accelerated-parallel-processing-app-sdk

                    [2] Gene M Amdahl 1967 Validity of the single processor approach to achievinglarge scale computing capabilities Proceedings of the April 18-20 1967 springjoint computer conference on - AFIPS rsquo67 (Spring) (1967) 483 httpsdoiorg10114514654821465560

                    [3] Dan Anderson 2012 How to tell if SPARC T4 crypto is being used httpsblogsoraclecomDanXentryhow_to_tell_if_sparc

                    [4] Krste Asanovic Rastislav Bodik James Demmel Tony Keaveny Kurt KeutzerJohn Kubiatowicz Nelson Morgan David Patterson Koushik Sen John

                    LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

                    Wawrzynek David Wessel and Katherine Yelick 2009 A View of the Par-allel Computing Landscape Commun ACM 52 10 (oct 2009) 56ndash67 httpsdoiorg10114515627641562783

                    [5] Nathan Beckmann and Daniel Sanchez 2016 Cache Calculus Modeling Cachesthrough Differential Equations Computer Architecture Letters PP 99 (2016)1 httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=7366753$delimiter026E30F$npapers3publicationdoi101109LCA20152512873

                    [6] C Cascaval S Chatterjee H Franke K J Gildea and P Pattnaik 2010 Ataxonomy of accelerator architectures and their programming models IBMJournal of Research and Development 54 (2010) 51ndash510 httpsdoiorg101147JRD20102059721

                    [7] Eric S Chung Peter a Milder James C Hoe and Ken Mai 2010 Single-ChipHeterogeneous Computing Does the Future Include Custom Logic FPGAs andGPGPUs 2010 43rd Annual IEEEACM International Symposium on Microar-chitecture (dec 2010) 225ndash236 httpsdoiorg101109MICRO201036

                    [8] Jason Cong Zhenman Fang Michael Gill and Glenn Reinman 2015 PARADEA Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Archi-tectural Design and Exploration In 2015 IEEEACM International Conference onComputer-Aided Design Austin TX

                    [9] D Culler R Karp D Patterson and A Sahay 1993 LogP Towards a realisticmodel of parallel computation In Proceedings of the Fourth ACM SIGPLANSymposium on Principles and Practice of Parallel Programming 1ndash12 httpdlacmorgcitationcfmid=155333

                    [10] Bruno da Silva An Braeken Erik H DrsquoHollander and Abdellah Touhafi 2013Performance Modeling for FPGAs Extending the Roofline Model with High-level Synthesis Tools Int J Reconfig Comput 2013 (jan 2013) 77mdash-77httpsdoiorg1011552013428078

                    [11] Mayank Daga Ashwin M Aji and Wu-chun Feng 2011 On the Efficacy of aFused CPU+GPU Processor (or APU) for Parallel Computing In 2011 Symposiumon Application Accelerators in High-Performance Computing Ieee 141ndash149httpsdoiorg101109SAAHPC201129

                    [12] Hadi Esmaeilzadeh Emily Blem Renee St Amant Karthikeyan Sankaralingamand Doug Burger 2011 Dark silicon and the end of multicore scaling InProceeding of the 38th annual international symposium on Computer archi-tecture - ISCA rsquo11 ACM Press New York New York USA 365 httpsdoiorg10114520000642000108

                    [13] H Franke J Xenidis C Basso B M Bass S S Woodward J D Brown andC L Johnson 2010 Introduction to the wire-speed processor and architectureIBM Journal of Research and Development 54 (2010) 31ndash311 httpsdoiorg101147JRD20092036980

                    [14] Venkatraman Govindaraju Chen Han Ho and Karthikeyan Sankaralingam 2011Dynamically specialized datapaths for energy efficient computing In Proceedings- International Symposium on High-Performance Computer Architecture 503ndash514httpsdoiorg101109HPCA20115749755

                    [15] Shay Gueron 2012 Intel Advanced Encryption Standard (AES) Instructions SetTechnical Report Intel Corporation httpssoftwareintelcomsitesdefaultfilesarticle165683aes-wp-2012-09-22-v01pdf

                    [16] Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex SolomatnikovBenjamin C Lee Stephen Richardson Christos Kozyrakis and Mark Horowitz2010 Understanding sources of inefficiency in general-purpose chips Proceed-ings of the 37th annual international symposium on Computer architecture - ISCA

                    rsquo10 (2010) 37 httpsdoiorg10114518159611815968[17] Mark Hempstead Gu-Yeon Wei and David Brooks 2009 Navigo An early-

                    stage model to study power-constrained architectures and specialization In ISCAWorkshop on Modeling Benchmarking and Simulations (MoBS)

                    [18] John L Hennessy and David A Patterson 2006 Computer Architecture FourthEdition A Quantitative Approach 704 pages httpsdoiorg10111151881

                    [19] Mark D Hill and Michael R Marty 2008 Amdahlrsquos Law in the Multicore EraComputer 41 7 (jul 2008) 33ndash38 httpsdoiorg101109MC2008209

                    [20] Sunpyo Hong and Hyesoon Kim 2009 An analytical model for a GPU architec-ture with memory-level and thread-level parallelism awareness In Proceedingsof the 36th Annual International Symposium on Computer Architecture Vol 37152ndash163 httpsdoiorg10114515558151555775

                    [21] Sunpyo Hong and Hyesoon Kim 2010 An integrated GPU power and perfor-mance model In Proceedings of the 37th Annual International Symposium onComputer Architecture Vol 38 280mdash-289 httpsdoiorg10114518160381815998

                    [22] Haipeng Jia Yunquan Zhang Guoping Long Jianliang Xu Shengen Yan andYan Li 2012 GPURoofline A Model for Guiding Performance Optimizations onGPUs In Proceedings of the 18th International Conference on Parallel Processing(Euro-Parrsquo12) Springer-Verlag Berlin Heidelberg 920ndash932 httpsdoiorg101007978-3-642-32820-6_90

                    [23] Onur Kocberber Boris Grot Javier Picorel Babak Falsafi Kevin Lim andParthasarathy Ranganathan 2013 Meet the Walkers Accelerating Index Traver-sals for In-memory Databases In Proceedings of the 46th Annual IEEEACMInternational Symposium on Microarchitecture (MICRO-46) ACM New YorkNY USA 468ndash479 httpsdoiorg10114525407082540748

                    [24] Karthik Kumar Jibang Liu Yung Hsiang Lu and Bharat Bhargava 2013 Asurvey of computation offloading for mobile systems Mobile Networks andApplications 18 (2013) 129ndash140 httpsdoiorg101007s11036-012-0368-0

                    [25] Snehasish Kumar Naveen Vedula Arrvindh Shriraman and Vijayalakshmi Srini-vasan 2015 DASX Hardware Accelerator for Software Data Structures InProceedings of the 29th ACM on International Conference on Supercomputing(ICS rsquo15) ACM New York NY USA 361ndash372 httpsdoiorg10114527512052751231

                    [26] Maysam Lavasani Hari Angepat and Derek Chiou 2014 An FPGA-based In-Line Accelerator for Memcached IEEE Comput Archit Lett 13 2 (jul 2014)57ndash60 httpsdoiorg101109L-CA201317

                    [27] John D C Little and Stephen C Graves 2008 Littlersquos law In Building intuitionSpringer 81ndash100

                    [28] U Lopez-Novoa A Mendiburu and J Miguel-Alonso 2015 A Survey of Perfor-mance Modeling and Simulation Techniques for Accelerator-Based ComputingParallel and Distributed Systems IEEE Transactions on 26 1 (jan 2015) 272ndash281httpsdoiorg101109TPDS20142308216

                    [29] M R Meswani L Carrington D Unat A Snavely S Baden and S Poole2013 Modeling and predicting performance of high performance computingapplications on hardware accelerators International Journal of High Perfor-mance Computing Applications 27 (2013) 89ndash108 httpsdoiorg1011771094342012468180

                    [30] National Institute of Standards and Technology 2001 Advanced EncryptionStandard (AES) National Institute of Standards and Technology httpsdoiorg106028NISTFIPS197

                    [31] National Institute of Standards and Technology 2008 Secure Hash StandardNational Institute of Standards and Technology httpcsrcnistgovpublicationsfipsfips180-3fips180-3_finalpdf

                    [32] S Nilakantan S Battle and M Hempstead 2013 Metrics for Early-Stage Model-ing of Many-Accelerator Architectures Computer Architecture Letters 12 1 (jan2013) 25ndash28 httpsdoiorg101109L-CA20129

                    [33] Cedric Nugteren and Henk Corporaal 2012 The Boat Hull Model EnablingPerformance Prediction for Parallel Computing Prior to Code Development Cate-gories and Subject Descriptors In Proceedings of the 9th Conference on Comput-ing Frontiers ACM 203mdash-212

                    [34] OpenSSL Software Foundation 2015 OpenSSL Cryptography and SSLTLSToolkit OpenSSL Software Foundation httpsopensslorg

                    [35] Sanjay Patel 2009 Sunrsquos Next-Generation Multithreaded Processor RainbowFalls In 21st Hot Chip Symposium httpwwwhotchipsorgwp-contentuploadshc

                    [36] Sanjay Patel and Wen-mei W Hwu 2008 Accelerator Architectures IEEE Micro28 4 (jul 2008) 4ndash12 httpsdoiorg101109MM200850

                    [37] Stephen Phillips 2014 M7 Next Generation SPARC In 26th Hot Chip Sympo-sium

                    [38] Phil Rogers 2013 Heterogeneous system architecture overview In Hot Chips[39] Yoshiei Sato Ryuichi Nagaoka Akihiro Musa Ryusuke Egawa Hiroyuki Tak-

                    izawa Koki Okabe and Hiroaki Kobayashi 2009 Performance tuning andanalysis of future vector processors based on the roofline model Proceedingsof the 10th MEDEA workshop on MEmory performance DEaling with Applica-tions systems and architecture - MEDEA rsquo09 (2009) 7 httpsdoiorg10114516219601621962

                    [40] M Shah J Barren J Brooks R Golla G Grohoski N Gura R Hetherington PJordan M Luttrell C Olson B Sana D Sheahan L Spracklen and A Wynn2007 UltraSPARC T2 A highly-treaded power-efficient SPARC SOC InSolid-State Circuits Conference 2007 ASSCC rsquo07 IEEE Asian 22ndash25 httpsdoiorg101109ASSCC20074425786

                    [41] Manish Shah Robert Golla Gregory Grohoski Paul Jordan Jama Barreh JeffreyBrooks Mark Greenberg Gideon Levinsky Mark Luttrell Christopher OlsonZeid Samoail Matt Smittle and Thomas Ziaja 2012 Sparc T4 A dynamicallythreaded server-on-a-chip IEEE Micro 32 (2012) 8ndash19 httpsdoiorg101109MM20121

                    [42] Yakun Sophia Shao Brandon Reagen Gu-Yeon Wei and David Brooks 2014Aladdin A Pre-RTL Power-Performance Accelerator Simulator Enabling LargeDesign Space Exploration of Customized Architectures In International Sympo-sium on Computer Architecture (ISCA)

                    [43] Soekris Engineering 2016 vpn 1401 for Std PCI-sockets Soekris Engineeringhttpsoekriscomproductsvpn-1401html

                    [44] Shuaiwen Song Chunyi Su Barry Rountree and Kirk W Cameron 2013 Asimplified and accurate model of power-performance efficiency on emergent GPUarchitectures In Proceedings - IEEE 27th International Parallel and DistributedProcessing Symposium IPDPS 2013 673ndash686 httpsdoiorg101109IPDPS201373

                    [45] Jeff Stuecheli 2013 POWER8 In 25th Hot Chip Symposium[46] Ning Sun and Chi-Chang Lin 2007 Using the Cryptographic Accelerators in the

                    UltraSPARC T1 and T2 processors Technical Report httpwwworaclecomtechnetworkserver-storagesolarisdocumentation819-5782-150147pdf

                    ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                    [47] S Tabik G Ortega and E M Garzoacuten 2014 Performance evaluation of ker-nel fusion BLAS routines on the GPU iterative solvers as case study TheJournal of Supercomputing 70 2 (nov 2014) 577ndash587 httpsdoiorg101007s11227-014-1102-4

                    [48] Y C Tay 2013 Analytical Performance Modeling for Computer Systems (2nded) Morgan amp Claypool Publishers

                    [49] MB Taylor 2012 Is dark silicon useful harnessing the four horsemen of thecoming dark silicon apocalypse In Design Automation Conference (DAC) 201249th ACMEDACIEEE 1131ndash1136 httpsdoiorg10114522283602228567

                    [50] G Venkatesh J Sampson N Goulding S Garcia V Bryksin J Lugo-MartinezS Swanson and M B Taylor 2010 Conservation cores Reducing the energyof mature computations In International Conference on Architectural Supportfor Programming Languages and Operating Systems - ASPLOS 205ndash218 httpsdoiorg10114517360201736044

                    [51] Ganesh Venkatesh Jack Sampson Nathan Goulding-Hotta Sravanthi KotaVenkata Michael Bedford Taylor and Steven Swanson 2011 QsCores Trad-ing Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores InProceedings of the 44th Annual IEEEACM International Symposium on Microar-chitecture - MICRO-44 rsquo11 163 httpsdoiorg10114521556202155640

                    [52] Guibin Wang Yisong Lin and Wei Yi 2010 Kernel Fusion An EffectiveMethod for Better Power Efficiency on Multithreaded GPU In Green Computingand Communications (GreenCom) 2010 IEEEACM Intrsquol Conference on CyberPhysical and Social Computing (CPSCom) 344ndash350 httpsdoiorg101109GreenCom-CPSCom2010102

                    [53] Eric W Weisstein 2015 Newtonrsquos Method From MathWorld ndash A Wolfram WebResource httpmathworldwolframcomNewtonsMethodhtml

                    [54] Samuel Williams Andrew Waterman and David Patterson 2009 Roofline aninsightful visual performance model for multicore architectures Commun ACM52 (2009) 65ndash76 httpsdoiorg10114514987651498785

                    [55] Lisa Wu Andrea Lottarini Timothy K Paine Martha A Kim and Kenneth ARoss 2014 Q100 The Architecture and Design of a Database Processing UnitIn Proceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS rsquo14) ACM NewYork NY USA 255ndash268 httpsdoiorg10114525419402541961

                    [56] Moein Pahlavan Yali 2014 FPGA-Roofline An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems PhD DissertationVirginia Polytechnic Institute and State University

                    [57] Yao Zhang and John D Owens 2011 A quantitative performance analysismodel for GPU architectures In Proceedings - International Symposium on High-Performance Computer Architecture 382ndash393 httpsdoiorg101109HPCA20115749745

                    • Abstract
                    • 1 Introduction
                    • 2 The LogCA Model
                      • 21 Effect of Granularity
                      • 22 Performance Metrics
                      • 23 Granularity dependent latency
                        • 3 Applications of LogCA
                          • 31 Performance Bounds
                          • 32 Sensitivity Analysis
                            • 4 Experimental Methodology
                            • 5 Evaluation
                              • 51 Linear-Complexity Kernels (= 1)
                              • 52 Super-Linear Complexity Kernels (gt 1)
                              • 53 Sub-Linear Complexity Kernels (lt 1)
                              • 54 Case Studies
                                • 6 Related Work
                                • 7 Conclusion and Future Work
                                • References

                      LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

                      16 128 1K 8K 64

                      K51

                      2K 4M 32M

                      01

                      1

                      10

                      100

                      1000

                      A

                      CL

                      oCLoC

                      LC

                      oL

                      Granularity (Bytes)

                      Sp

                      eed

                      up

                      (a) PCIe Crypto Accelerator

                      16 128 g1 1K 8K 64

                      K51

                      2K 4M 32M

                      01

                      1

                      10

                      100

                      1000

                      A

                      CL

                      oC AoCA

                      oA

                      Granularity (Bytes)

                      (b) UltraSPARC T2

                      16 128 g1 1K 8K 64

                      K51

                      2K 4M 32M

                      01

                      1

                      10

                      100

                      1000CL

                      A

                      oC AoCA

                      oA

                      Granularity (Bytes)

                      (c) SPARC T3

                      g112

                      8 1K 8K 64K

                      512K 4M 32

                      M

                      01

                      1

                      10

                      100

                      1000

                      A

                      oCA A

                      CL

                      oA

                      Granularity (Bytes)

                      Sp

                      eed

                      up

                      (d) SPARC T4 engine

                      16 128 1K 8K 64

                      K51

                      2K 4M 32M

                      01

                      1

                      10

                      100

                      1000

                      A

                      oCA A

                      CL

                      oA

                      Granularity (Bytes)

                      (e) SPARC T4 instruction

                      LogCA L110xo110x C10x A10x

                      Figure 10 LogCA for performing Advanced Encryption Standard on various crypto accelerators LogCA identifies the design bottle-necks through LogCA parameters in an optimization region The bottlenecks which LogCA suggests in each design is optimized inthe next design

                      16 128 1K 8K g1

                      64K

                      512K 4M 32

                      M

                      01

                      1

                      10

                      100

                      1000

                      A

                      LoC LCALC A

                      ALC

                      o

                      Granularity (Bytes)

                      Sp

                      eed

                      up

                      (a) NVIDIA Discrete GPU

                      16 128 1K 8K g1

                      64K

                      512K 4M 32

                      M

                      01

                      1

                      10

                      100

                      1000

                      A

                      LoCLo

                      CA

                      ACA

                      AoL

                      C

                      Granularity (Bytes)

                      (b) AMD Integrated GPU (APU)

                      16 128 g11K 8K 64

                      K51

                      2K 4M 32M

                      01

                      1

                      10

                      100

                      1000

                      A

                      oC o

                      CA

                      CA

                      o

                      AC

                      Granularity (Bytes)

                      (c) HSA supported AMD Integrated GPU

                      LogCA L110xo110x C10x A10x

                      Figure 11 Various Optimization regions for matrix multiplication over a range of granularities on (a) NVIDIA discrete GPU (b)AMD APU and (c) HSA Supported GPU

                      (sect31) We also observe that the computational index occupies mostof the regions which signifies maximum optimization potential

                      The discrete GPU has four optimization regions (Figure 11 (a))Among these latency dominates most of the regions signifyinghigh-latency data copying over the PCIe bus and thus maximumoptimization potential The high cut-off granularity for overheads at

                      32KB indicates high overhead OS calls to access the GPU Similarlywith highly aggressive cores acceleration has high cut-off granular-ity of 256KB indicating less optimization potential for acceleration

                      Similar to the discrete GPU the APU also has four optimiza-tion regions (Figure 11 (b)) There are few notable differences ascompared to the discrete GPU The cut-off granularity for latency

                      ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                      reduces to 512KB with the elimination of data copying over thePCIe bus the overheads are still high suggesting high overhead OScalls to access the APU with less aggressive cores the cut-off granu-larity for acceleration reduces to 64KB implying more optimizationpotential for acceleration

                      Figure 11 (c) shows three optimization regions for the HSA en-abled integrated GPU We observe that latency is absent in all regionsand the cut-off granularity for overhead reduces to 8KB These re-ductions in overheads and latencies signify a simpler interface ascompared to the discrete GPU and APU We also observe that thecut-off granularity for acceleration drops to 2KB suggesting higherpotential for optimizing acceleration

                      6 RELATED WORKWe compare and contrast our work with prior approaches Lopez-Novoa et al [28] provide a detailed survey of various acceleratormodeling techniques We broadly classify these techniques in twocategories and discuss the most relevant work

                      Analytical Models There is a rich body of work exploring ana-lytical models for performance prediction of accelerators For somemodels the motivation is to determine the future trend in heteroge-neous architectures Chung et al [7] in a detailed study predict thefuture landscape of heterogeneous computing Hempstead et al [17]propose an early-stage model Navigo that determines the fraction ofarea required for accelerators to maintain the traditional performancetrend Nilakantan et al [32] propose to incorporate communicationcost for early-stage model of accelerator-rich architectures For oth-ers the motivation is to determine the right amount of data to offloadand the potential benefits associated with an accelerator [24]

                      Some analytical models are architecture specific For examplea number of studies [20 21 44 57] predict performance of GPUarchitectures Hong et al [20] present an analytical performancemodel for predicting execution time on GPUs They later extendtheir model and develop an integrated power and performance modelfor the GPUs [21] Song et al [44] use a simple counter basedapproach to predict power and performance Meswani et al [29]explore such models for high performance applications Daga etal [11] analyze the effectiveness of Accelerated processing units(APU) over GPUs and describe the communication cost over thePCIe bus as a major bottleneck in exploiting the full potential ofGPUs

                      In general our work is different from these studies because ofthe complexity These models use a large number of parameters toaccurately predict the power andor performance whereas we limitthe number of parameters to reduce the complexity of our modelThey also require deep understanding of the underlying architectureMost of these models also require access to GPU specific assemblyor PTX codes Unlike these approaches we use CPU code to providebounds on the performance

                      Roofline Models In terms of simplicity and motivation our workclosely matches the Roofline model [54]mdasha visual performancemodel for multi-core architectures Roofline exposes bottlenecks fora kernel and suggests several optimizations which programmers canuse to fine tune the kernel on a given system

                      A number of extensions of Roofline have been proposed [1022 33 56] and some of these extensions are architecture specific

                      For example targeting GPUs [22] vector processors [39] and FP-GAs [10 56]

                      Despite the similarities roofline and its extensions cannot be usedfor exposing design bottlenecks in an acceleratorrsquos interface Theprimary goal of roofline models has been to help programmers andcompiler writer while LogCA provides more insights for architects

                      7 CONCLUSION AND FUTURE WORKWith the recent trend towards heterogeneous computing we feelthat the architecture community lacks a model to reason about theneed of accelerators In this respect we propose LogCAmdashan insight-ful visual performance model for hardware accelerators LogCAprovides insights early in the design stage to both architects andprogrammers and identifies performance bounds exposes interfacedesign bottlenecks and suggest optimizations to alleviate these bot-tlenecks We have validated our model across a range of on-chip andoff-chip accelerators and have shown its utility using retrospectivestudies describing the evolution of acceleratorrsquos interface in thesearchitectures

                      The applicability of LogCA can be limited by our simplifying as-sumptions and for more realistic analysis we plan to overcome theselimitations in our future work For example We also assume a singleaccelerator system and do not explicitly model contention amongresources Our model should handle multi-accelerator and pipelinedscenarios For fixed function accelerators our design space is cur-rently limited to encryption and hashing kernels To overcome thiswe are expanding our design space with compression and databaseaccelerators in Oracle M7 processor We also plan to complementLogCA with an energy model as energy efficiency is a prime designmetric for accelerators

                      ACKNOWLEDGEMENTSWe thank our anonymous reviewers Arkaprava Basu and TonyNowatzki for their insightful comments and feedback on the paperThanks to Mark Hill Michael Swift Wisconsin Computer Archi-tecture Affiliates and other members of the Multifacet group fortheir valuable discussions We also thank Brian Wilson at Universityof Wisconsin DoIT and Eric Sedlar at Oracle Labs for providingaccess to SPARC T3 and T4 servers respectively Also thanks toMuhammad Umair Bin Altaf for his help in the formulation Thiswork is supported in part by the National Science Foundation (CNS-1302260 CCF-1438992 CCF-1533885 CCF- 1617824) Googleand the University of Wisconsin-Madison (Amar and Balindar SohiProfessorship in Computer Science) Wood has a significant financialinterest in AMD and Google

                      REFERENCES[1] Advanced Micro Devices 2016 APP SDK - A Complete Development Platform

                      Advanced Micro Devices httpdeveloperamdcomtools-and-sdksopencl-zoneamd-accelerated-parallel-processing-app-sdk

                      [2] Gene M Amdahl 1967 Validity of the single processor approach to achievinglarge scale computing capabilities Proceedings of the April 18-20 1967 springjoint computer conference on - AFIPS rsquo67 (Spring) (1967) 483 httpsdoiorg10114514654821465560

                      [3] Dan Anderson 2012 How to tell if SPARC T4 crypto is being used httpsblogsoraclecomDanXentryhow_to_tell_if_sparc

                      [4] Krste Asanovic Rastislav Bodik James Demmel Tony Keaveny Kurt KeutzerJohn Kubiatowicz Nelson Morgan David Patterson Koushik Sen John

                      LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

                      Wawrzynek David Wessel and Katherine Yelick 2009 A View of the Par-allel Computing Landscape Commun ACM 52 10 (oct 2009) 56ndash67 httpsdoiorg10114515627641562783

                      [5] Nathan Beckmann and Daniel Sanchez 2016 Cache Calculus Modeling Cachesthrough Differential Equations Computer Architecture Letters PP 99 (2016)1 httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=7366753$delimiter026E30F$npapers3publicationdoi101109LCA20152512873

                      [6] C Cascaval S Chatterjee H Franke K J Gildea and P Pattnaik 2010 Ataxonomy of accelerator architectures and their programming models IBMJournal of Research and Development 54 (2010) 51ndash510 httpsdoiorg101147JRD20102059721

                      [7] Eric S Chung Peter a Milder James C Hoe and Ken Mai 2010 Single-ChipHeterogeneous Computing Does the Future Include Custom Logic FPGAs andGPGPUs 2010 43rd Annual IEEEACM International Symposium on Microar-chitecture (dec 2010) 225ndash236 httpsdoiorg101109MICRO201036

                      [8] Jason Cong Zhenman Fang Michael Gill and Glenn Reinman 2015 PARADEA Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Archi-tectural Design and Exploration In 2015 IEEEACM International Conference onComputer-Aided Design Austin TX

                      [9] D Culler R Karp D Patterson and A Sahay 1993 LogP Towards a realisticmodel of parallel computation In Proceedings of the Fourth ACM SIGPLANSymposium on Principles and Practice of Parallel Programming 1ndash12 httpdlacmorgcitationcfmid=155333

                      [10] Bruno da Silva An Braeken Erik H DrsquoHollander and Abdellah Touhafi 2013Performance Modeling for FPGAs Extending the Roofline Model with High-level Synthesis Tools Int J Reconfig Comput 2013 (jan 2013) 77mdash-77httpsdoiorg1011552013428078

                      [11] Mayank Daga Ashwin M Aji and Wu-chun Feng 2011 On the Efficacy of aFused CPU+GPU Processor (or APU) for Parallel Computing In 2011 Symposiumon Application Accelerators in High-Performance Computing Ieee 141ndash149httpsdoiorg101109SAAHPC201129

                      [12] Hadi Esmaeilzadeh Emily Blem Renee St Amant Karthikeyan Sankaralingamand Doug Burger 2011 Dark silicon and the end of multicore scaling InProceeding of the 38th annual international symposium on Computer archi-tecture - ISCA rsquo11 ACM Press New York New York USA 365 httpsdoiorg10114520000642000108

                      [13] H Franke J Xenidis C Basso B M Bass S S Woodward J D Brown andC L Johnson 2010 Introduction to the wire-speed processor and architectureIBM Journal of Research and Development 54 (2010) 31ndash311 httpsdoiorg101147JRD20092036980

                      [14] Venkatraman Govindaraju Chen Han Ho and Karthikeyan Sankaralingam 2011Dynamically specialized datapaths for energy efficient computing In Proceedings- International Symposium on High-Performance Computer Architecture 503ndash514httpsdoiorg101109HPCA20115749755

                      [15] Shay Gueron 2012 Intel Advanced Encryption Standard (AES) Instructions SetTechnical Report Intel Corporation httpssoftwareintelcomsitesdefaultfilesarticle165683aes-wp-2012-09-22-v01pdf

                      [16] Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex SolomatnikovBenjamin C Lee Stephen Richardson Christos Kozyrakis and Mark Horowitz2010 Understanding sources of inefficiency in general-purpose chips Proceed-ings of the 37th annual international symposium on Computer architecture - ISCA

                      rsquo10 (2010) 37 httpsdoiorg10114518159611815968[17] Mark Hempstead Gu-Yeon Wei and David Brooks 2009 Navigo An early-

                      stage model to study power-constrained architectures and specialization In ISCAWorkshop on Modeling Benchmarking and Simulations (MoBS)

                      [18] John L Hennessy and David A Patterson 2006 Computer Architecture FourthEdition A Quantitative Approach 704 pages httpsdoiorg10111151881

                      [19] Mark D Hill and Michael R Marty 2008 Amdahlrsquos Law in the Multicore EraComputer 41 7 (jul 2008) 33ndash38 httpsdoiorg101109MC2008209

                      [20] Sunpyo Hong and Hyesoon Kim 2009 An analytical model for a GPU architec-ture with memory-level and thread-level parallelism awareness In Proceedingsof the 36th Annual International Symposium on Computer Architecture Vol 37152ndash163 httpsdoiorg10114515558151555775

                      [21] Sunpyo Hong and Hyesoon Kim 2010 An integrated GPU power and perfor-mance model In Proceedings of the 37th Annual International Symposium onComputer Architecture Vol 38 280mdash-289 httpsdoiorg10114518160381815998

                      [22] Haipeng Jia Yunquan Zhang Guoping Long Jianliang Xu Shengen Yan andYan Li 2012 GPURoofline A Model for Guiding Performance Optimizations onGPUs In Proceedings of the 18th International Conference on Parallel Processing(Euro-Parrsquo12) Springer-Verlag Berlin Heidelberg 920ndash932 httpsdoiorg101007978-3-642-32820-6_90

                      [23] Onur Kocberber Boris Grot Javier Picorel Babak Falsafi Kevin Lim andParthasarathy Ranganathan 2013 Meet the Walkers Accelerating Index Traver-sals for In-memory Databases In Proceedings of the 46th Annual IEEEACMInternational Symposium on Microarchitecture (MICRO-46) ACM New YorkNY USA 468ndash479 httpsdoiorg10114525407082540748

                      [24] Karthik Kumar Jibang Liu Yung Hsiang Lu and Bharat Bhargava 2013 Asurvey of computation offloading for mobile systems Mobile Networks andApplications 18 (2013) 129ndash140 httpsdoiorg101007s11036-012-0368-0

                      [25] Snehasish Kumar Naveen Vedula Arrvindh Shriraman and Vijayalakshmi Srini-vasan 2015 DASX Hardware Accelerator for Software Data Structures InProceedings of the 29th ACM on International Conference on Supercomputing(ICS rsquo15) ACM New York NY USA 361ndash372 httpsdoiorg10114527512052751231

                      [26] Maysam Lavasani Hari Angepat and Derek Chiou 2014 An FPGA-based In-Line Accelerator for Memcached IEEE Comput Archit Lett 13 2 (jul 2014)57ndash60 httpsdoiorg101109L-CA201317

                      [27] John D C Little and Stephen C Graves 2008 Littlersquos law In Building intuitionSpringer 81ndash100

                      [28] U Lopez-Novoa A Mendiburu and J Miguel-Alonso 2015 A Survey of Perfor-mance Modeling and Simulation Techniques for Accelerator-Based ComputingParallel and Distributed Systems IEEE Transactions on 26 1 (jan 2015) 272ndash281httpsdoiorg101109TPDS20142308216

                      [29] M R Meswani L Carrington D Unat A Snavely S Baden and S Poole2013 Modeling and predicting performance of high performance computingapplications on hardware accelerators International Journal of High Perfor-mance Computing Applications 27 (2013) 89ndash108 httpsdoiorg1011771094342012468180

                      [30] National Institute of Standards and Technology 2001 Advanced EncryptionStandard (AES) National Institute of Standards and Technology httpsdoiorg106028NISTFIPS197

                      [31] National Institute of Standards and Technology 2008 Secure Hash StandardNational Institute of Standards and Technology httpcsrcnistgovpublicationsfipsfips180-3fips180-3_finalpdf

                      [32] S Nilakantan S Battle and M Hempstead 2013 Metrics for Early-Stage Model-ing of Many-Accelerator Architectures Computer Architecture Letters 12 1 (jan2013) 25ndash28 httpsdoiorg101109L-CA20129

                      [33] Cedric Nugteren and Henk Corporaal 2012 The Boat Hull Model EnablingPerformance Prediction for Parallel Computing Prior to Code Development Cate-gories and Subject Descriptors In Proceedings of the 9th Conference on Comput-ing Frontiers ACM 203mdash-212

                      [34] OpenSSL Software Foundation 2015 OpenSSL Cryptography and SSLTLSToolkit OpenSSL Software Foundation httpsopensslorg

                      [35] Sanjay Patel 2009 Sunrsquos Next-Generation Multithreaded Processor RainbowFalls In 21st Hot Chip Symposium httpwwwhotchipsorgwp-contentuploadshc

                      [36] Sanjay Patel and Wen-mei W Hwu 2008 Accelerator Architectures IEEE Micro28 4 (jul 2008) 4ndash12 httpsdoiorg101109MM200850

                      [37] Stephen Phillips 2014 M7 Next Generation SPARC In 26th Hot Chip Sympo-sium

                      [38] Phil Rogers 2013 Heterogeneous system architecture overview In Hot Chips[39] Yoshiei Sato Ryuichi Nagaoka Akihiro Musa Ryusuke Egawa Hiroyuki Tak-

                      izawa Koki Okabe and Hiroaki Kobayashi 2009 Performance tuning andanalysis of future vector processors based on the roofline model Proceedingsof the 10th MEDEA workshop on MEmory performance DEaling with Applica-tions systems and architecture - MEDEA rsquo09 (2009) 7 httpsdoiorg10114516219601621962

                      [40] M Shah J Barren J Brooks R Golla G Grohoski N Gura R Hetherington PJordan M Luttrell C Olson B Sana D Sheahan L Spracklen and A Wynn2007 UltraSPARC T2 A highly-treaded power-efficient SPARC SOC InSolid-State Circuits Conference 2007 ASSCC rsquo07 IEEE Asian 22ndash25 httpsdoiorg101109ASSCC20074425786

                      [41] Manish Shah Robert Golla Gregory Grohoski Paul Jordan Jama Barreh JeffreyBrooks Mark Greenberg Gideon Levinsky Mark Luttrell Christopher OlsonZeid Samoail Matt Smittle and Thomas Ziaja 2012 Sparc T4 A dynamicallythreaded server-on-a-chip IEEE Micro 32 (2012) 8ndash19 httpsdoiorg101109MM20121

                      [42] Yakun Sophia Shao Brandon Reagen Gu-Yeon Wei and David Brooks 2014Aladdin A Pre-RTL Power-Performance Accelerator Simulator Enabling LargeDesign Space Exploration of Customized Architectures In International Sympo-sium on Computer Architecture (ISCA)

                      [43] Soekris Engineering 2016 vpn 1401 for Std PCI-sockets Soekris Engineeringhttpsoekriscomproductsvpn-1401html

                      [44] Shuaiwen Song Chunyi Su Barry Rountree and Kirk W Cameron 2013 Asimplified and accurate model of power-performance efficiency on emergent GPUarchitectures In Proceedings - IEEE 27th International Parallel and DistributedProcessing Symposium IPDPS 2013 673ndash686 httpsdoiorg101109IPDPS201373

                      [45] Jeff Stuecheli 2013 POWER8 In 25th Hot Chip Symposium[46] Ning Sun and Chi-Chang Lin 2007 Using the Cryptographic Accelerators in the

                      UltraSPARC T1 and T2 processors Technical Report httpwwworaclecomtechnetworkserver-storagesolarisdocumentation819-5782-150147pdf

                      ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                      [47] S Tabik G Ortega and E M Garzoacuten 2014 Performance evaluation of ker-nel fusion BLAS routines on the GPU iterative solvers as case study TheJournal of Supercomputing 70 2 (nov 2014) 577ndash587 httpsdoiorg101007s11227-014-1102-4

                      [48] Y C Tay 2013 Analytical Performance Modeling for Computer Systems (2nded) Morgan amp Claypool Publishers

                      [49] MB Taylor 2012 Is dark silicon useful harnessing the four horsemen of thecoming dark silicon apocalypse In Design Automation Conference (DAC) 201249th ACMEDACIEEE 1131ndash1136 httpsdoiorg10114522283602228567

                      [50] G Venkatesh J Sampson N Goulding S Garcia V Bryksin J Lugo-MartinezS Swanson and M B Taylor 2010 Conservation cores Reducing the energyof mature computations In International Conference on Architectural Supportfor Programming Languages and Operating Systems - ASPLOS 205ndash218 httpsdoiorg10114517360201736044

                      [51] Ganesh Venkatesh Jack Sampson Nathan Goulding-Hotta Sravanthi KotaVenkata Michael Bedford Taylor and Steven Swanson 2011 QsCores Trad-ing Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores InProceedings of the 44th Annual IEEEACM International Symposium on Microar-chitecture - MICRO-44 rsquo11 163 httpsdoiorg10114521556202155640

                      [52] Guibin Wang Yisong Lin and Wei Yi 2010 Kernel Fusion An EffectiveMethod for Better Power Efficiency on Multithreaded GPU In Green Computingand Communications (GreenCom) 2010 IEEEACM Intrsquol Conference on CyberPhysical and Social Computing (CPSCom) 344ndash350 httpsdoiorg101109GreenCom-CPSCom2010102

                      [53] Eric W Weisstein 2015 Newtonrsquos Method From MathWorld ndash A Wolfram WebResource httpmathworldwolframcomNewtonsMethodhtml

                      [54] Samuel Williams Andrew Waterman and David Patterson 2009 Roofline aninsightful visual performance model for multicore architectures Commun ACM52 (2009) 65ndash76 httpsdoiorg10114514987651498785

                      [55] Lisa Wu Andrea Lottarini Timothy K Paine Martha A Kim and Kenneth ARoss 2014 Q100 The Architecture and Design of a Database Processing UnitIn Proceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS rsquo14) ACM NewYork NY USA 255ndash268 httpsdoiorg10114525419402541961

                      [56] Moein Pahlavan Yali 2014 FPGA-Roofline An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems PhD DissertationVirginia Polytechnic Institute and State University

                      [57] Yao Zhang and John D Owens 2011 A quantitative performance analysismodel for GPU architectures In Proceedings - International Symposium on High-Performance Computer Architecture 382ndash393 httpsdoiorg101109HPCA20115749745

                      • Abstract
                      • 1 Introduction
                      • 2 The LogCA Model
                        • 21 Effect of Granularity
                        • 22 Performance Metrics
                        • 23 Granularity dependent latency
                          • 3 Applications of LogCA
                            • 31 Performance Bounds
                            • 32 Sensitivity Analysis
                              • 4 Experimental Methodology
                              • 5 Evaluation
                                • 51 Linear-Complexity Kernels (= 1)
                                • 52 Super-Linear Complexity Kernels (gt 1)
                                • 53 Sub-Linear Complexity Kernels (lt 1)
                                • 54 Case Studies
                                  • 6 Related Work
                                  • 7 Conclusion and Future Work
                                  • References

                        ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                        reduces to 512KB with the elimination of data copying over thePCIe bus the overheads are still high suggesting high overhead OScalls to access the APU with less aggressive cores the cut-off granu-larity for acceleration reduces to 64KB implying more optimizationpotential for acceleration

                        Figure 11 (c) shows three optimization regions for the HSA en-abled integrated GPU We observe that latency is absent in all regionsand the cut-off granularity for overhead reduces to 8KB These re-ductions in overheads and latencies signify a simpler interface ascompared to the discrete GPU and APU We also observe that thecut-off granularity for acceleration drops to 2KB suggesting higherpotential for optimizing acceleration

                        6 RELATED WORKWe compare and contrast our work with prior approaches Lopez-Novoa et al [28] provide a detailed survey of various acceleratormodeling techniques We broadly classify these techniques in twocategories and discuss the most relevant work

                        Analytical Models There is a rich body of work exploring ana-lytical models for performance prediction of accelerators For somemodels the motivation is to determine the future trend in heteroge-neous architectures Chung et al [7] in a detailed study predict thefuture landscape of heterogeneous computing Hempstead et al [17]propose an early-stage model Navigo that determines the fraction ofarea required for accelerators to maintain the traditional performancetrend Nilakantan et al [32] propose to incorporate communicationcost for early-stage model of accelerator-rich architectures For oth-ers the motivation is to determine the right amount of data to offloadand the potential benefits associated with an accelerator [24]

                        Some analytical models are architecture specific For examplea number of studies [20 21 44 57] predict performance of GPUarchitectures Hong et al [20] present an analytical performancemodel for predicting execution time on GPUs They later extendtheir model and develop an integrated power and performance modelfor the GPUs [21] Song et al [44] use a simple counter basedapproach to predict power and performance Meswani et al [29]explore such models for high performance applications Daga etal [11] analyze the effectiveness of Accelerated processing units(APU) over GPUs and describe the communication cost over thePCIe bus as a major bottleneck in exploiting the full potential ofGPUs

                        In general our work is different from these studies because ofthe complexity These models use a large number of parameters toaccurately predict the power andor performance whereas we limitthe number of parameters to reduce the complexity of our modelThey also require deep understanding of the underlying architectureMost of these models also require access to GPU specific assemblyor PTX codes Unlike these approaches we use CPU code to providebounds on the performance

                        Roofline Models In terms of simplicity and motivation our workclosely matches the Roofline model [54]mdasha visual performancemodel for multi-core architectures Roofline exposes bottlenecks fora kernel and suggests several optimizations which programmers canuse to fine tune the kernel on a given system

                        A number of extensions of Roofline have been proposed [1022 33 56] and some of these extensions are architecture specific

                        For example targeting GPUs [22] vector processors [39] and FP-GAs [10 56]

                        Despite the similarities roofline and its extensions cannot be usedfor exposing design bottlenecks in an acceleratorrsquos interface Theprimary goal of roofline models has been to help programmers andcompiler writer while LogCA provides more insights for architects

                        7 CONCLUSION AND FUTURE WORKWith the recent trend towards heterogeneous computing we feelthat the architecture community lacks a model to reason about theneed of accelerators In this respect we propose LogCAmdashan insight-ful visual performance model for hardware accelerators LogCAprovides insights early in the design stage to both architects andprogrammers and identifies performance bounds exposes interfacedesign bottlenecks and suggest optimizations to alleviate these bot-tlenecks We have validated our model across a range of on-chip andoff-chip accelerators and have shown its utility using retrospectivestudies describing the evolution of acceleratorrsquos interface in thesearchitectures

                        The applicability of LogCA can be limited by our simplifying as-sumptions and for more realistic analysis we plan to overcome theselimitations in our future work For example We also assume a singleaccelerator system and do not explicitly model contention amongresources Our model should handle multi-accelerator and pipelinedscenarios For fixed function accelerators our design space is cur-rently limited to encryption and hashing kernels To overcome thiswe are expanding our design space with compression and databaseaccelerators in Oracle M7 processor We also plan to complementLogCA with an energy model as energy efficiency is a prime designmetric for accelerators

                        ACKNOWLEDGEMENTSWe thank our anonymous reviewers Arkaprava Basu and TonyNowatzki for their insightful comments and feedback on the paperThanks to Mark Hill Michael Swift Wisconsin Computer Archi-tecture Affiliates and other members of the Multifacet group fortheir valuable discussions We also thank Brian Wilson at Universityof Wisconsin DoIT and Eric Sedlar at Oracle Labs for providingaccess to SPARC T3 and T4 servers respectively Also thanks toMuhammad Umair Bin Altaf for his help in the formulation Thiswork is supported in part by the National Science Foundation (CNS-1302260 CCF-1438992 CCF-1533885 CCF- 1617824) Googleand the University of Wisconsin-Madison (Amar and Balindar SohiProfessorship in Computer Science) Wood has a significant financialinterest in AMD and Google

                        REFERENCES[1] Advanced Micro Devices 2016 APP SDK - A Complete Development Platform

                        Advanced Micro Devices httpdeveloperamdcomtools-and-sdksopencl-zoneamd-accelerated-parallel-processing-app-sdk

                        [2] Gene M Amdahl 1967 Validity of the single processor approach to achievinglarge scale computing capabilities Proceedings of the April 18-20 1967 springjoint computer conference on - AFIPS rsquo67 (Spring) (1967) 483 httpsdoiorg10114514654821465560

                        [3] Dan Anderson 2012 How to tell if SPARC T4 crypto is being used httpsblogsoraclecomDanXentryhow_to_tell_if_sparc

                        [4] Krste Asanovic Rastislav Bodik James Demmel Tony Keaveny Kurt KeutzerJohn Kubiatowicz Nelson Morgan David Patterson Koushik Sen John

                        LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

                        Wawrzynek David Wessel and Katherine Yelick 2009 A View of the Par-allel Computing Landscape Commun ACM 52 10 (oct 2009) 56ndash67 httpsdoiorg10114515627641562783

                        [5] Nathan Beckmann and Daniel Sanchez 2016 Cache Calculus Modeling Cachesthrough Differential Equations Computer Architecture Letters PP 99 (2016)1 httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=7366753$delimiter026E30F$npapers3publicationdoi101109LCA20152512873

                        [6] C Cascaval S Chatterjee H Franke K J Gildea and P Pattnaik 2010 Ataxonomy of accelerator architectures and their programming models IBMJournal of Research and Development 54 (2010) 51ndash510 httpsdoiorg101147JRD20102059721

                        [7] Eric S Chung Peter a Milder James C Hoe and Ken Mai 2010 Single-ChipHeterogeneous Computing Does the Future Include Custom Logic FPGAs andGPGPUs 2010 43rd Annual IEEEACM International Symposium on Microar-chitecture (dec 2010) 225ndash236 httpsdoiorg101109MICRO201036

                        [8] Jason Cong Zhenman Fang Michael Gill and Glenn Reinman 2015 PARADEA Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Archi-tectural Design and Exploration In 2015 IEEEACM International Conference onComputer-Aided Design Austin TX

                        [9] D Culler R Karp D Patterson and A Sahay 1993 LogP Towards a realisticmodel of parallel computation In Proceedings of the Fourth ACM SIGPLANSymposium on Principles and Practice of Parallel Programming 1ndash12 httpdlacmorgcitationcfmid=155333

                        [10] Bruno da Silva An Braeken Erik H DrsquoHollander and Abdellah Touhafi 2013Performance Modeling for FPGAs Extending the Roofline Model with High-level Synthesis Tools Int J Reconfig Comput 2013 (jan 2013) 77mdash-77httpsdoiorg1011552013428078

                        [11] Mayank Daga Ashwin M Aji and Wu-chun Feng 2011 On the Efficacy of aFused CPU+GPU Processor (or APU) for Parallel Computing In 2011 Symposiumon Application Accelerators in High-Performance Computing Ieee 141ndash149httpsdoiorg101109SAAHPC201129

                        [12] Hadi Esmaeilzadeh Emily Blem Renee St Amant Karthikeyan Sankaralingamand Doug Burger 2011 Dark silicon and the end of multicore scaling InProceeding of the 38th annual international symposium on Computer archi-tecture - ISCA rsquo11 ACM Press New York New York USA 365 httpsdoiorg10114520000642000108

                        [13] H Franke J Xenidis C Basso B M Bass S S Woodward J D Brown andC L Johnson 2010 Introduction to the wire-speed processor and architectureIBM Journal of Research and Development 54 (2010) 31ndash311 httpsdoiorg101147JRD20092036980

                        [14] Venkatraman Govindaraju Chen Han Ho and Karthikeyan Sankaralingam 2011Dynamically specialized datapaths for energy efficient computing In Proceedings- International Symposium on High-Performance Computer Architecture 503ndash514httpsdoiorg101109HPCA20115749755

                        [15] Shay Gueron 2012 Intel Advanced Encryption Standard (AES) Instructions SetTechnical Report Intel Corporation httpssoftwareintelcomsitesdefaultfilesarticle165683aes-wp-2012-09-22-v01pdf

                        [16] Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex SolomatnikovBenjamin C Lee Stephen Richardson Christos Kozyrakis and Mark Horowitz2010 Understanding sources of inefficiency in general-purpose chips Proceed-ings of the 37th annual international symposium on Computer architecture - ISCA

                        rsquo10 (2010) 37 httpsdoiorg10114518159611815968[17] Mark Hempstead Gu-Yeon Wei and David Brooks 2009 Navigo An early-

                        stage model to study power-constrained architectures and specialization In ISCAWorkshop on Modeling Benchmarking and Simulations (MoBS)

                        [18] John L Hennessy and David A Patterson 2006 Computer Architecture FourthEdition A Quantitative Approach 704 pages httpsdoiorg10111151881

                        [19] Mark D Hill and Michael R Marty 2008 Amdahlrsquos Law in the Multicore EraComputer 41 7 (jul 2008) 33ndash38 httpsdoiorg101109MC2008209

                        [20] Sunpyo Hong and Hyesoon Kim 2009 An analytical model for a GPU architec-ture with memory-level and thread-level parallelism awareness In Proceedingsof the 36th Annual International Symposium on Computer Architecture Vol 37152ndash163 httpsdoiorg10114515558151555775

                        [21] Sunpyo Hong and Hyesoon Kim 2010 An integrated GPU power and perfor-mance model In Proceedings of the 37th Annual International Symposium onComputer Architecture Vol 38 280mdash-289 httpsdoiorg10114518160381815998

                        [22] Haipeng Jia Yunquan Zhang Guoping Long Jianliang Xu Shengen Yan andYan Li 2012 GPURoofline A Model for Guiding Performance Optimizations onGPUs In Proceedings of the 18th International Conference on Parallel Processing(Euro-Parrsquo12) Springer-Verlag Berlin Heidelberg 920ndash932 httpsdoiorg101007978-3-642-32820-6_90

                        [23] Onur Kocberber Boris Grot Javier Picorel Babak Falsafi Kevin Lim andParthasarathy Ranganathan 2013 Meet the Walkers Accelerating Index Traver-sals for In-memory Databases In Proceedings of the 46th Annual IEEEACMInternational Symposium on Microarchitecture (MICRO-46) ACM New YorkNY USA 468ndash479 httpsdoiorg10114525407082540748

                        [24] Karthik Kumar Jibang Liu Yung Hsiang Lu and Bharat Bhargava 2013 Asurvey of computation offloading for mobile systems Mobile Networks andApplications 18 (2013) 129ndash140 httpsdoiorg101007s11036-012-0368-0

                        [25] Snehasish Kumar Naveen Vedula Arrvindh Shriraman and Vijayalakshmi Srini-vasan 2015 DASX Hardware Accelerator for Software Data Structures InProceedings of the 29th ACM on International Conference on Supercomputing(ICS rsquo15) ACM New York NY USA 361ndash372 httpsdoiorg10114527512052751231

                        [26] Maysam Lavasani Hari Angepat and Derek Chiou 2014 An FPGA-based In-Line Accelerator for Memcached IEEE Comput Archit Lett 13 2 (jul 2014)57ndash60 httpsdoiorg101109L-CA201317

                        [27] John D C Little and Stephen C Graves 2008 Littlersquos law In Building intuitionSpringer 81ndash100

                        [28] U Lopez-Novoa A Mendiburu and J Miguel-Alonso 2015 A Survey of Perfor-mance Modeling and Simulation Techniques for Accelerator-Based ComputingParallel and Distributed Systems IEEE Transactions on 26 1 (jan 2015) 272ndash281httpsdoiorg101109TPDS20142308216

                        [29] M R Meswani L Carrington D Unat A Snavely S Baden and S Poole2013 Modeling and predicting performance of high performance computingapplications on hardware accelerators International Journal of High Perfor-mance Computing Applications 27 (2013) 89ndash108 httpsdoiorg1011771094342012468180

                        [30] National Institute of Standards and Technology 2001 Advanced EncryptionStandard (AES) National Institute of Standards and Technology httpsdoiorg106028NISTFIPS197

                        [31] National Institute of Standards and Technology 2008 Secure Hash StandardNational Institute of Standards and Technology httpcsrcnistgovpublicationsfipsfips180-3fips180-3_finalpdf

                        [32] S Nilakantan S Battle and M Hempstead 2013 Metrics for Early-Stage Model-ing of Many-Accelerator Architectures Computer Architecture Letters 12 1 (jan2013) 25ndash28 httpsdoiorg101109L-CA20129

                        [33] Cedric Nugteren and Henk Corporaal 2012 The Boat Hull Model EnablingPerformance Prediction for Parallel Computing Prior to Code Development Cate-gories and Subject Descriptors In Proceedings of the 9th Conference on Comput-ing Frontiers ACM 203mdash-212

                        [34] OpenSSL Software Foundation 2015 OpenSSL Cryptography and SSLTLSToolkit OpenSSL Software Foundation httpsopensslorg

                        [35] Sanjay Patel 2009 Sunrsquos Next-Generation Multithreaded Processor RainbowFalls In 21st Hot Chip Symposium httpwwwhotchipsorgwp-contentuploadshc

                        [36] Sanjay Patel and Wen-mei W Hwu 2008 Accelerator Architectures IEEE Micro28 4 (jul 2008) 4ndash12 httpsdoiorg101109MM200850

                        [37] Stephen Phillips 2014 M7 Next Generation SPARC In 26th Hot Chip Sympo-sium

                        [38] Phil Rogers 2013 Heterogeneous system architecture overview In Hot Chips[39] Yoshiei Sato Ryuichi Nagaoka Akihiro Musa Ryusuke Egawa Hiroyuki Tak-

                        izawa Koki Okabe and Hiroaki Kobayashi 2009 Performance tuning andanalysis of future vector processors based on the roofline model Proceedingsof the 10th MEDEA workshop on MEmory performance DEaling with Applica-tions systems and architecture - MEDEA rsquo09 (2009) 7 httpsdoiorg10114516219601621962

                        [40] M Shah J Barren J Brooks R Golla G Grohoski N Gura R Hetherington PJordan M Luttrell C Olson B Sana D Sheahan L Spracklen and A Wynn2007 UltraSPARC T2 A highly-treaded power-efficient SPARC SOC InSolid-State Circuits Conference 2007 ASSCC rsquo07 IEEE Asian 22ndash25 httpsdoiorg101109ASSCC20074425786

                        [41] Manish Shah Robert Golla Gregory Grohoski Paul Jordan Jama Barreh JeffreyBrooks Mark Greenberg Gideon Levinsky Mark Luttrell Christopher OlsonZeid Samoail Matt Smittle and Thomas Ziaja 2012 Sparc T4 A dynamicallythreaded server-on-a-chip IEEE Micro 32 (2012) 8ndash19 httpsdoiorg101109MM20121

                        [42] Yakun Sophia Shao Brandon Reagen Gu-Yeon Wei and David Brooks 2014Aladdin A Pre-RTL Power-Performance Accelerator Simulator Enabling LargeDesign Space Exploration of Customized Architectures In International Sympo-sium on Computer Architecture (ISCA)

                        [43] Soekris Engineering 2016 vpn 1401 for Std PCI-sockets Soekris Engineeringhttpsoekriscomproductsvpn-1401html

                        [44] Shuaiwen Song Chunyi Su Barry Rountree and Kirk W Cameron 2013 Asimplified and accurate model of power-performance efficiency on emergent GPUarchitectures In Proceedings - IEEE 27th International Parallel and DistributedProcessing Symposium IPDPS 2013 673ndash686 httpsdoiorg101109IPDPS201373

                        [45] Jeff Stuecheli 2013 POWER8 In 25th Hot Chip Symposium[46] Ning Sun and Chi-Chang Lin 2007 Using the Cryptographic Accelerators in the

                        UltraSPARC T1 and T2 processors Technical Report httpwwworaclecomtechnetworkserver-storagesolarisdocumentation819-5782-150147pdf

                        ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                        [47] S Tabik G Ortega and E M Garzoacuten 2014 Performance evaluation of ker-nel fusion BLAS routines on the GPU iterative solvers as case study TheJournal of Supercomputing 70 2 (nov 2014) 577ndash587 httpsdoiorg101007s11227-014-1102-4

                        [48] Y C Tay 2013 Analytical Performance Modeling for Computer Systems (2nded) Morgan amp Claypool Publishers

                        [49] MB Taylor 2012 Is dark silicon useful harnessing the four horsemen of thecoming dark silicon apocalypse In Design Automation Conference (DAC) 201249th ACMEDACIEEE 1131ndash1136 httpsdoiorg10114522283602228567

                        [50] G Venkatesh J Sampson N Goulding S Garcia V Bryksin J Lugo-MartinezS Swanson and M B Taylor 2010 Conservation cores Reducing the energyof mature computations In International Conference on Architectural Supportfor Programming Languages and Operating Systems - ASPLOS 205ndash218 httpsdoiorg10114517360201736044

                        [51] Ganesh Venkatesh Jack Sampson Nathan Goulding-Hotta Sravanthi KotaVenkata Michael Bedford Taylor and Steven Swanson 2011 QsCores Trad-ing Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores InProceedings of the 44th Annual IEEEACM International Symposium on Microar-chitecture - MICRO-44 rsquo11 163 httpsdoiorg10114521556202155640

                        [52] Guibin Wang Yisong Lin and Wei Yi 2010 Kernel Fusion An EffectiveMethod for Better Power Efficiency on Multithreaded GPU In Green Computingand Communications (GreenCom) 2010 IEEEACM Intrsquol Conference on CyberPhysical and Social Computing (CPSCom) 344ndash350 httpsdoiorg101109GreenCom-CPSCom2010102

                        [53] Eric W Weisstein 2015 Newtonrsquos Method From MathWorld ndash A Wolfram WebResource httpmathworldwolframcomNewtonsMethodhtml

                        [54] Samuel Williams Andrew Waterman and David Patterson 2009 Roofline aninsightful visual performance model for multicore architectures Commun ACM52 (2009) 65ndash76 httpsdoiorg10114514987651498785

                        [55] Lisa Wu Andrea Lottarini Timothy K Paine Martha A Kim and Kenneth ARoss 2014 Q100 The Architecture and Design of a Database Processing UnitIn Proceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS rsquo14) ACM NewYork NY USA 255ndash268 httpsdoiorg10114525419402541961

                        [56] Moein Pahlavan Yali 2014 FPGA-Roofline An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems PhD DissertationVirginia Polytechnic Institute and State University

                        [57] Yao Zhang and John D Owens 2011 A quantitative performance analysismodel for GPU architectures In Proceedings - International Symposium on High-Performance Computer Architecture 382ndash393 httpsdoiorg101109HPCA20115749745

                        • Abstract
                        • 1 Introduction
                        • 2 The LogCA Model
                          • 21 Effect of Granularity
                          • 22 Performance Metrics
                          • 23 Granularity dependent latency
                            • 3 Applications of LogCA
                              • 31 Performance Bounds
                              • 32 Sensitivity Analysis
                                • 4 Experimental Methodology
                                • 5 Evaluation
                                  • 51 Linear-Complexity Kernels (= 1)
                                  • 52 Super-Linear Complexity Kernels (gt 1)
                                  • 53 Sub-Linear Complexity Kernels (lt 1)
                                  • 54 Case Studies
                                    • 6 Related Work
                                    • 7 Conclusion and Future Work
                                    • References

                          LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

                          Wawrzynek David Wessel and Katherine Yelick 2009 A View of the Par-allel Computing Landscape Commun ACM 52 10 (oct 2009) 56ndash67 httpsdoiorg10114515627641562783

                          [5] Nathan Beckmann and Daniel Sanchez 2016 Cache Calculus Modeling Cachesthrough Differential Equations Computer Architecture Letters PP 99 (2016)1 httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=7366753$delimiter026E30F$npapers3publicationdoi101109LCA20152512873

                          [6] C Cascaval S Chatterjee H Franke K J Gildea and P Pattnaik 2010 Ataxonomy of accelerator architectures and their programming models IBMJournal of Research and Development 54 (2010) 51ndash510 httpsdoiorg101147JRD20102059721

                          [7] Eric S Chung Peter a Milder James C Hoe and Ken Mai 2010 Single-ChipHeterogeneous Computing Does the Future Include Custom Logic FPGAs andGPGPUs 2010 43rd Annual IEEEACM International Symposium on Microar-chitecture (dec 2010) 225ndash236 httpsdoiorg101109MICRO201036

                          [8] Jason Cong Zhenman Fang Michael Gill and Glenn Reinman 2015 PARADEA Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Archi-tectural Design and Exploration In 2015 IEEEACM International Conference onComputer-Aided Design Austin TX

                          [9] D Culler R Karp D Patterson and A Sahay 1993 LogP Towards a realisticmodel of parallel computation In Proceedings of the Fourth ACM SIGPLANSymposium on Principles and Practice of Parallel Programming 1ndash12 httpdlacmorgcitationcfmid=155333

                          [10] Bruno da Silva An Braeken Erik H DrsquoHollander and Abdellah Touhafi 2013Performance Modeling for FPGAs Extending the Roofline Model with High-level Synthesis Tools Int J Reconfig Comput 2013 (jan 2013) 77mdash-77httpsdoiorg1011552013428078

                          [11] Mayank Daga Ashwin M Aji and Wu-chun Feng 2011 On the Efficacy of aFused CPU+GPU Processor (or APU) for Parallel Computing In 2011 Symposiumon Application Accelerators in High-Performance Computing Ieee 141ndash149httpsdoiorg101109SAAHPC201129

                          [12] Hadi Esmaeilzadeh Emily Blem Renee St Amant Karthikeyan Sankaralingamand Doug Burger 2011 Dark silicon and the end of multicore scaling InProceeding of the 38th annual international symposium on Computer archi-tecture - ISCA rsquo11 ACM Press New York New York USA 365 httpsdoiorg10114520000642000108

                          [13] H Franke J Xenidis C Basso B M Bass S S Woodward J D Brown andC L Johnson 2010 Introduction to the wire-speed processor and architectureIBM Journal of Research and Development 54 (2010) 31ndash311 httpsdoiorg101147JRD20092036980

                          [14] Venkatraman Govindaraju Chen Han Ho and Karthikeyan Sankaralingam 2011Dynamically specialized datapaths for energy efficient computing In Proceedings- International Symposium on High-Performance Computer Architecture 503ndash514httpsdoiorg101109HPCA20115749755

                          [15] Shay Gueron 2012 Intel Advanced Encryption Standard (AES) Instructions SetTechnical Report Intel Corporation httpssoftwareintelcomsitesdefaultfilesarticle165683aes-wp-2012-09-22-v01pdf

                          [16] Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex SolomatnikovBenjamin C Lee Stephen Richardson Christos Kozyrakis and Mark Horowitz2010 Understanding sources of inefficiency in general-purpose chips Proceed-ings of the 37th annual international symposium on Computer architecture - ISCA

                          rsquo10 (2010) 37 httpsdoiorg10114518159611815968[17] Mark Hempstead Gu-Yeon Wei and David Brooks 2009 Navigo An early-

                          stage model to study power-constrained architectures and specialization In ISCAWorkshop on Modeling Benchmarking and Simulations (MoBS)

                          [18] John L Hennessy and David A Patterson 2006 Computer Architecture FourthEdition A Quantitative Approach 704 pages httpsdoiorg10111151881

                          [19] Mark D Hill and Michael R Marty 2008 Amdahlrsquos Law in the Multicore EraComputer 41 7 (jul 2008) 33ndash38 httpsdoiorg101109MC2008209

                          [20] Sunpyo Hong and Hyesoon Kim 2009 An analytical model for a GPU architec-ture with memory-level and thread-level parallelism awareness In Proceedingsof the 36th Annual International Symposium on Computer Architecture Vol 37152ndash163 httpsdoiorg10114515558151555775

                          [21] Sunpyo Hong and Hyesoon Kim 2010 An integrated GPU power and perfor-mance model In Proceedings of the 37th Annual International Symposium onComputer Architecture Vol 38 280mdash-289 httpsdoiorg10114518160381815998

                          [22] Haipeng Jia Yunquan Zhang Guoping Long Jianliang Xu Shengen Yan andYan Li 2012 GPURoofline A Model for Guiding Performance Optimizations onGPUs In Proceedings of the 18th International Conference on Parallel Processing(Euro-Parrsquo12) Springer-Verlag Berlin Heidelberg 920ndash932 httpsdoiorg101007978-3-642-32820-6_90

                          [23] Onur Kocberber Boris Grot Javier Picorel Babak Falsafi Kevin Lim andParthasarathy Ranganathan 2013 Meet the Walkers Accelerating Index Traver-sals for In-memory Databases In Proceedings of the 46th Annual IEEEACMInternational Symposium on Microarchitecture (MICRO-46) ACM New YorkNY USA 468ndash479 httpsdoiorg10114525407082540748

                          [24] Karthik Kumar Jibang Liu Yung Hsiang Lu and Bharat Bhargava 2013 Asurvey of computation offloading for mobile systems Mobile Networks andApplications 18 (2013) 129ndash140 httpsdoiorg101007s11036-012-0368-0

                          [25] Snehasish Kumar Naveen Vedula Arrvindh Shriraman and Vijayalakshmi Srini-vasan 2015 DASX Hardware Accelerator for Software Data Structures InProceedings of the 29th ACM on International Conference on Supercomputing(ICS rsquo15) ACM New York NY USA 361ndash372 httpsdoiorg10114527512052751231

                          [26] Maysam Lavasani Hari Angepat and Derek Chiou 2014 An FPGA-based In-Line Accelerator for Memcached IEEE Comput Archit Lett 13 2 (jul 2014)57ndash60 httpsdoiorg101109L-CA201317

                          [27] John D C Little and Stephen C Graves 2008 Littlersquos law In Building intuitionSpringer 81ndash100

                          [28] U Lopez-Novoa A Mendiburu and J Miguel-Alonso 2015 A Survey of Perfor-mance Modeling and Simulation Techniques for Accelerator-Based ComputingParallel and Distributed Systems IEEE Transactions on 26 1 (jan 2015) 272ndash281httpsdoiorg101109TPDS20142308216

                          [29] M R Meswani L Carrington D Unat A Snavely S Baden and S Poole2013 Modeling and predicting performance of high performance computingapplications on hardware accelerators International Journal of High Perfor-mance Computing Applications 27 (2013) 89ndash108 httpsdoiorg1011771094342012468180

                          [30] National Institute of Standards and Technology 2001 Advanced EncryptionStandard (AES) National Institute of Standards and Technology httpsdoiorg106028NISTFIPS197

                          [31] National Institute of Standards and Technology 2008 Secure Hash StandardNational Institute of Standards and Technology httpcsrcnistgovpublicationsfipsfips180-3fips180-3_finalpdf

                          [32] S Nilakantan S Battle and M Hempstead 2013 Metrics for Early-Stage Model-ing of Many-Accelerator Architectures Computer Architecture Letters 12 1 (jan2013) 25ndash28 httpsdoiorg101109L-CA20129

                          [33] Cedric Nugteren and Henk Corporaal 2012 The Boat Hull Model EnablingPerformance Prediction for Parallel Computing Prior to Code Development Cate-gories and Subject Descriptors In Proceedings of the 9th Conference on Comput-ing Frontiers ACM 203mdash-212

                          [34] OpenSSL Software Foundation 2015 OpenSSL Cryptography and SSLTLSToolkit OpenSSL Software Foundation httpsopensslorg

                          [35] Sanjay Patel 2009 Sunrsquos Next-Generation Multithreaded Processor RainbowFalls In 21st Hot Chip Symposium httpwwwhotchipsorgwp-contentuploadshc

                          [36] Sanjay Patel and Wen-mei W Hwu 2008 Accelerator Architectures IEEE Micro28 4 (jul 2008) 4ndash12 httpsdoiorg101109MM200850

                          [37] Stephen Phillips 2014 M7 Next Generation SPARC In 26th Hot Chip Sympo-sium

                          [38] Phil Rogers 2013 Heterogeneous system architecture overview In Hot Chips[39] Yoshiei Sato Ryuichi Nagaoka Akihiro Musa Ryusuke Egawa Hiroyuki Tak-

                          izawa Koki Okabe and Hiroaki Kobayashi 2009 Performance tuning andanalysis of future vector processors based on the roofline model Proceedingsof the 10th MEDEA workshop on MEmory performance DEaling with Applica-tions systems and architecture - MEDEA rsquo09 (2009) 7 httpsdoiorg10114516219601621962

                          [40] M Shah J Barren J Brooks R Golla G Grohoski N Gura R Hetherington PJordan M Luttrell C Olson B Sana D Sheahan L Spracklen and A Wynn2007 UltraSPARC T2 A highly-treaded power-efficient SPARC SOC InSolid-State Circuits Conference 2007 ASSCC rsquo07 IEEE Asian 22ndash25 httpsdoiorg101109ASSCC20074425786

                          [41] Manish Shah Robert Golla Gregory Grohoski Paul Jordan Jama Barreh JeffreyBrooks Mark Greenberg Gideon Levinsky Mark Luttrell Christopher OlsonZeid Samoail Matt Smittle and Thomas Ziaja 2012 Sparc T4 A dynamicallythreaded server-on-a-chip IEEE Micro 32 (2012) 8ndash19 httpsdoiorg101109MM20121

                          [42] Yakun Sophia Shao Brandon Reagen Gu-Yeon Wei and David Brooks 2014Aladdin A Pre-RTL Power-Performance Accelerator Simulator Enabling LargeDesign Space Exploration of Customized Architectures In International Sympo-sium on Computer Architecture (ISCA)

                          [43] Soekris Engineering 2016 vpn 1401 for Std PCI-sockets Soekris Engineeringhttpsoekriscomproductsvpn-1401html

                          [44] Shuaiwen Song Chunyi Su Barry Rountree and Kirk W Cameron 2013 Asimplified and accurate model of power-performance efficiency on emergent GPUarchitectures In Proceedings - IEEE 27th International Parallel and DistributedProcessing Symposium IPDPS 2013 673ndash686 httpsdoiorg101109IPDPS201373

                          [45] Jeff Stuecheli 2013 POWER8 In 25th Hot Chip Symposium[46] Ning Sun and Chi-Chang Lin 2007 Using the Cryptographic Accelerators in the

                          UltraSPARC T1 and T2 processors Technical Report httpwwworaclecomtechnetworkserver-storagesolarisdocumentation819-5782-150147pdf

                          ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                          [47] S Tabik G Ortega and E M Garzoacuten 2014 Performance evaluation of ker-nel fusion BLAS routines on the GPU iterative solvers as case study TheJournal of Supercomputing 70 2 (nov 2014) 577ndash587 httpsdoiorg101007s11227-014-1102-4

                          [48] Y C Tay 2013 Analytical Performance Modeling for Computer Systems (2nded) Morgan amp Claypool Publishers

                          [49] MB Taylor 2012 Is dark silicon useful harnessing the four horsemen of thecoming dark silicon apocalypse In Design Automation Conference (DAC) 201249th ACMEDACIEEE 1131ndash1136 httpsdoiorg10114522283602228567

                          [50] G Venkatesh J Sampson N Goulding S Garcia V Bryksin J Lugo-MartinezS Swanson and M B Taylor 2010 Conservation cores Reducing the energyof mature computations In International Conference on Architectural Supportfor Programming Languages and Operating Systems - ASPLOS 205ndash218 httpsdoiorg10114517360201736044

                          [51] Ganesh Venkatesh Jack Sampson Nathan Goulding-Hotta Sravanthi KotaVenkata Michael Bedford Taylor and Steven Swanson 2011 QsCores Trad-ing Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores InProceedings of the 44th Annual IEEEACM International Symposium on Microar-chitecture - MICRO-44 rsquo11 163 httpsdoiorg10114521556202155640

                          [52] Guibin Wang Yisong Lin and Wei Yi 2010 Kernel Fusion An EffectiveMethod for Better Power Efficiency on Multithreaded GPU In Green Computingand Communications (GreenCom) 2010 IEEEACM Intrsquol Conference on CyberPhysical and Social Computing (CPSCom) 344ndash350 httpsdoiorg101109GreenCom-CPSCom2010102

                          [53] Eric W Weisstein 2015 Newtonrsquos Method From MathWorld ndash A Wolfram WebResource httpmathworldwolframcomNewtonsMethodhtml

                          [54] Samuel Williams Andrew Waterman and David Patterson 2009 Roofline aninsightful visual performance model for multicore architectures Commun ACM52 (2009) 65ndash76 httpsdoiorg10114514987651498785

                          [55] Lisa Wu Andrea Lottarini Timothy K Paine Martha A Kim and Kenneth ARoss 2014 Q100 The Architecture and Design of a Database Processing UnitIn Proceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS rsquo14) ACM NewYork NY USA 255ndash268 httpsdoiorg10114525419402541961

                          [56] Moein Pahlavan Yali 2014 FPGA-Roofline An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems PhD DissertationVirginia Polytechnic Institute and State University

                          [57] Yao Zhang and John D Owens 2011 A quantitative performance analysismodel for GPU architectures In Proceedings - International Symposium on High-Performance Computer Architecture 382ndash393 httpsdoiorg101109HPCA20115749745

                          • Abstract
                          • 1 Introduction
                          • 2 The LogCA Model
                            • 21 Effect of Granularity
                            • 22 Performance Metrics
                            • 23 Granularity dependent latency
                              • 3 Applications of LogCA
                                • 31 Performance Bounds
                                • 32 Sensitivity Analysis
                                  • 4 Experimental Methodology
                                  • 5 Evaluation
                                    • 51 Linear-Complexity Kernels (= 1)
                                    • 52 Super-Linear Complexity Kernels (gt 1)
                                    • 53 Sub-Linear Complexity Kernels (lt 1)
                                    • 54 Case Studies
                                      • 6 Related Work
                                      • 7 Conclusion and Future Work
                                      • References

                            ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

                            [47] S Tabik G Ortega and E M Garzoacuten 2014 Performance evaluation of ker-nel fusion BLAS routines on the GPU iterative solvers as case study TheJournal of Supercomputing 70 2 (nov 2014) 577ndash587 httpsdoiorg101007s11227-014-1102-4

                            [48] Y C Tay 2013 Analytical Performance Modeling for Computer Systems (2nded) Morgan amp Claypool Publishers

                            [49] MB Taylor 2012 Is dark silicon useful harnessing the four horsemen of thecoming dark silicon apocalypse In Design Automation Conference (DAC) 201249th ACMEDACIEEE 1131ndash1136 httpsdoiorg10114522283602228567

                            [50] G Venkatesh J Sampson N Goulding S Garcia V Bryksin J Lugo-MartinezS Swanson and M B Taylor 2010 Conservation cores Reducing the energyof mature computations In International Conference on Architectural Supportfor Programming Languages and Operating Systems - ASPLOS 205ndash218 httpsdoiorg10114517360201736044

                            [51] Ganesh Venkatesh Jack Sampson Nathan Goulding-Hotta Sravanthi KotaVenkata Michael Bedford Taylor and Steven Swanson 2011 QsCores Trad-ing Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores InProceedings of the 44th Annual IEEEACM International Symposium on Microar-chitecture - MICRO-44 rsquo11 163 httpsdoiorg10114521556202155640

                            [52] Guibin Wang Yisong Lin and Wei Yi 2010 Kernel Fusion An EffectiveMethod for Better Power Efficiency on Multithreaded GPU In Green Computingand Communications (GreenCom) 2010 IEEEACM Intrsquol Conference on CyberPhysical and Social Computing (CPSCom) 344ndash350 httpsdoiorg101109GreenCom-CPSCom2010102

                            [53] Eric W Weisstein 2015 Newtonrsquos Method From MathWorld ndash A Wolfram WebResource httpmathworldwolframcomNewtonsMethodhtml

                            [54] Samuel Williams Andrew Waterman and David Patterson 2009 Roofline aninsightful visual performance model for multicore architectures Commun ACM52 (2009) 65ndash76 httpsdoiorg10114514987651498785

                            [55] Lisa Wu Andrea Lottarini Timothy K Paine Martha A Kim and Kenneth ARoss 2014 Q100 The Architecture and Design of a Database Processing UnitIn Proceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS rsquo14) ACM NewYork NY USA 255ndash268 httpsdoiorg10114525419402541961

                            [56] Moein Pahlavan Yali 2014 FPGA-Roofline An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems PhD DissertationVirginia Polytechnic Institute and State University

                            [57] Yao Zhang and John D Owens 2011 A quantitative performance analysismodel for GPU architectures In Proceedings - International Symposium on High-Performance Computer Architecture 382ndash393 httpsdoiorg101109HPCA20115749745

                            • Abstract
                            • 1 Introduction
                            • 2 The LogCA Model
                              • 21 Effect of Granularity
                              • 22 Performance Metrics
                              • 23 Granularity dependent latency
                                • 3 Applications of LogCA
                                  • 31 Performance Bounds
                                  • 32 Sensitivity Analysis
                                    • 4 Experimental Methodology
                                    • 5 Evaluation
                                      • 51 Linear-Complexity Kernels (= 1)
                                      • 52 Super-Linear Complexity Kernels (gt 1)
                                      • 53 Sub-Linear Complexity Kernels (lt 1)
                                      • 54 Case Studies
                                        • 6 Related Work
                                        • 7 Conclusion and Future Work
                                        • References

                              top related