ACASE FOR SPECIALIZED PROCESSORS S -OUT WORKLOADS · benchmark suites: Parsec 2.1 Parallel work-loads, SPEC CPU2006 desktop and engineer-ing workloads, SPECweb09 traditional web services,

..................................................................................................................................................................................................................

A CASE FOR SPECIALIZED PROCESSORSFOR SCALE-OUT WORKLOADS

..................................................................................................................................................................................................................

EMERGING SCALE-OUT WORKLOADS NEED EXTENSIVE COMPUTATIONAL RESOURCES, BUT

DATACENTERS USING MODERN SERVER HARDWARE FACE PHYSICAL CONSTRAINTS. IN THIS

ARTICLE, THE AUTHORS SHOW THAT MODERN SERVER PROCESSORS ARE HIGHLY

INEFFICIENT FOR RUNNING CLOUD WORKLOADS. THEY INVESTIGATE THE

MICROARCHITECTURAL BEHAVIOR OF SCALE-OUT WORKLOADS AND PRESENT

OPPORTUNITIES TO ENABLE SPECIALIZED PROCESSOR DESIGNS THAT CLOSELY MATCH THE

NEEDS OF THE CLOUD.

......Cloud computing is emerging asa dominant computing platform for deliver-ing scalable online services to a global clientbase. Today’s popular online services, such asweb search, social networks, and video shar-ing, are all hosted in large scale-out datacen-ters. With the industry rapidly expanding,service providers are building new datacen-ters, augmenting the existing infrastructureto meet the increasing demand. However,while demand for cloud infrastructure con-tinues to grow, the semiconductor manufac-turing industry has reached the physicallimits of voltage scaling,1,2 no longer able toreduce power consumption or increase powerdensity in new chips. Physical constraintshave therefore become the dominant limitingfactor, because the size and power demandsof larger datacenters cannot be met.

Although major design changes are beingintroduced at the board and chassis levels ofnew cloud servers, the processors used inmodern servers were originally created fordesktops and are not designed to efficientlyrun scale-out workloads. Processor vendorsuse the same underlying microarchitecture

for servers and for the general-purpose mar-ket, leading to extreme inefficiency in today’sdatacenters. Moreover, both general-purposeand traditional server processor designs fol-low a trajectory that benefits scale-up work-loads, a trend that was established for desktopprocessors long before the emergence of scale-out workloads.

In this article, based on our paper for the17th International Conference on Architec-tural Support for Programming Languagesand Operating Systems,3 we observe thatscale-out workloads share many inherentcharacteristics that place them into a work-load class distinct from desktop, parallel, andtraditional server workloads. We perform adetailed microarchitectural study of a rangeof scale-out workloads, finding a large mis-match between the demands of the scale-outworkloads and today’s predominant pro-cessor microarchitecture. We observe signifi-cant overprovisioning of the memoryhierarchy and core microarchitectural resour-ces for the scale-out workloads.

We use performance counters to study thebehavior of scale-out workloads running on

Michael Ferdman

Stony Brook University

Almutaz Adileh

Ghent University

Onur Kocberber

Stavros Volos

Mohammad Alisafaee

Djordje Jevdjic

Cansu Kaynak

Adrian Daniel Popescu

Anastasia Ailamaki

Babak Falsafi�Ecole Polytechnique F�ed�erale

de Lausanne

0272-1732/14/$31.00�c 2014 IEEE Published by the IEEE Computer Society

.............................................................

31

modern server processors. On the basis ofour analysis, we demonstrate the following:

� Scale-out workloads suffer from highinstruction-cache miss rates. Instruc-tion caches and associated next-lineprefetchers found in modern pro-cessors are inadequate for scale-outworkloads.

� Instruction-level parallelism (ILP)and memory-level parallelism (MLP)in scale-out workloads are low. Modernaggressive out-of-order cores are ex-cessively complex, consuming powerand on-chip area without providingperformance benefits to scale-outworkloads.

� Data working sets of scale-out work-loads considerably exceed the ca-pacity of on-chip caches. Processorreal estate and power are misspent onlarge last-level caches that do not con-tribute to improved scale-out work-load performance.

� On- and off-chip bandwidth require-ments of scale-out workloads arelow. Scale-out workloads see no bene-fit from fine-grained coherence andexcessive memory and core-to-corecommunication bandwidth.

Continuing the current processor trendswill further widen the mismatch betweenscale-out workloads and server processors.Conversely, the characteristics of scale-outworkloads can be effectively leveraged to spe-cialize processors for these workloads in orderto gain area and energy efficiency in futureservers. An example of such a specialized pro-cessor design that matches the needs of scale-out workloads is Scale-Out Processor,4 whichhas been shown to improve the systemthroughput and the overall datacenter costefficiency by almost an order of magnitude.5

Modern cores and scale-out workloadsToday’s datacenters are built around con-

ventional desktop processors whose architec-ture was designed for a broad market. Thedominant processor architecture has closelyfollowed the technology trends, improvingsingle-thread performance with each pro-cessor generation by using the increased clock

speeds and “free” (in area and power) transis-tors provided by progress in semiconductormanufacturing. Although Dennard scalinghas stopped,1,2,6,7 with both clock frequencyand transistor counts becoming limited bypower, processor architects have continued tospend resources on improving single-threadperformance for a broad range of applicationsat the expense of area and power efficiency.

In this article, we study a set of applica-tions that dominate today’s cloud infrastruc-ture. We examined a selection of Internetservices on the basis of their popularity. Foreach popular service, we analyzed the classof application software used by major pro-viders to offer these services, either on theirown cloud infrastructure or on a cloudinfrastructure leased from a third party.Overall, we found that scale-out workloadshave similar characteristics. All applicationswe examined

� operate on large data sets that are dis-tributed across a large number ofmachines, typically into memory-resident shards;

� serve large numbers of completelyindependent requests that do notshare any state;

� have application software designedspecifically for the cloud infrastruc-ture, where unreliable machines maycome and go; and

� use connectivity only for high-leveltask management and coordination.

Specifically, we identified and studied thefollowing workloads: an in-memory objectcache (Data Caching); a NoSQL persistentdata store (Data Serving); data filtering,transformation, and analysis (MapReduce); avideo-streaming service (Media Streaming); alarge-scale irregular engineering computation(SAT Solver); a dynamic Web 2.0 service(Web Frontend); and an online search enginenode (Web Search). To highlight the dif-ferences between scale-out workloads andtraditional workloads, we evaluated cloudworkloads alongside the following traditionalbenchmark suites: Parsec 2.1 Parallel work-loads, SPEC CPU2006 desktop and engineer-ing workloads, SPECweb09 traditional webservices, TPC-C traditional transaction proc-essing workload, TPC-E modern transaction

..............................................................................................................................................................................................

TOP PICKS

............................................................

32 IEEE MICRO

processing workload, and MySQL Web 2.0back-end database.

MethodologyWe conducted our study on a PowerEdge

M1000e enclosure with two Intel X5670 pro-cessors and 24 Gbytes of RAM in each blade,using Intel VTune to analyze the system’smicroarchitectural behavior. Each Intel X5670processor includes six aggressive out-of-orderprocessor cores with a three-level cache hier-archy: the L1 and L2 caches are private toeach core; the last-level cache (LLC)—the L3cache—is shared among all cores. Each coreincludes several simple stride and streamprefetchers, labeled as “adjacent-line,” “HWprefetcher,” and “DCU streamer” in theprocessor documentation and system BIOSsettings. The blades use high-performanceBroadcom server network interface controllers(NICs) with drivers that support multipletransmit queues and receive-side scaling. TheNICs are connected by a built-in M6220switch. For bandwidth-intensive benchmarks,2-Gbit NICs are used in each blade.

Table 1 summarizes the blades’ key archi-tectural parameters. We limited all workloadconfigurations to four cores, tuning theworkloads to achieve high utilization of thecores (or hardware threads, in the case ofthe SMT experiments), while maintainingthe workload quality-of-service requirements.To ensure that all application and operating

system software runs on the cores under test,we disabled all unused cores using the avail-able operating system mechanisms.

ResultsWe explore the microarchitectural behav-

ior of scale-out workloads by examining thecommit-time execution breakdown in Fig-ure 1. We classify each cycle of execution asCommitting if at least one instruction wascommitted during that cycle, or as Stalledotherwise. We note that computing a break-down of the execution-time stall componentsof superscalar out-of-order processors cannotbe performed precisely because of overlappedwork in the pipeline. We therefore presentexecution-time breakdown results based onthe performance counters that have no over-lap. Alongside the breakdown, we show theMemory cycles, which approximate timespent on long-latency memory accesses, butpotentially partially overlap with instructioncommits.

The execution-time breakdown of scale-out workloads is dominated by stalls in boththe application code and operating system.Notably, most of the stalls in scale-out work-loads arise because of long-latency memoryaccesses. This behavior is in contrast to theCPU-intensive desktop (SPEC2006) andparallel (Parsec) benchmarks, which stall exe-cution significantly less than 50 percentof the cycles and experience only a fraction

Table 1. Architectural parameters.

Component Details

Processor 32-nm Intel Xeon X5670, operating at 2.93 GHz

Chip multiprocessor width Six out-of-order cores

Core width Four-wide issue and retire

Reorder buffer 128 entries

Load-store queue 48/32 entries

Reservation stations 36 entries

Level-1 caches 32 Kbytes instruction and 32 Kbytes data, four-cycle

access latency

Level-2 cache 256 Kbytes per core, six-cycle access latency

Last-level cache (Level-3 cache) 12 Mbytes, 29-cycle access latency

Memory 24 Gbytes, three double-data-rate three (DDR3) channels,

delivering up to 32 Gbytes/second

.............................................................

MAY/JUNE 2014 33

of the stalls due to memory accesses. Further-more, although the execution-time break-down of some scale-out workloads (suchas MapReduce and SAT Solver) appearssimilar to the memory-intensive Parsec andSPEC2006 benchmarks, the nature of theseworkloads’ stalls is different. Unlike the scale-out workloads, many Parsec and SPEC2006applications frequently stall because of pipe-line flushes after wrong-path instructions,with much of the memory access time not onthe critical path of execution.

Scale-out workloads show memory systembehavior that more closely matches tradi-tional online transaction processing work-loads (TPC-C, TPC-E, and Web Backend).However, we observe that scale-out work-loads differ considerably from traditionalonline transaction processing (TPC-C), whichspends more than 80 percent of the timestalled, owing to dependent memory accesses.We find that scale-out workloads are mostsimilar to the more recent transaction process-ing benchmarks (TPC-E) that use more com-plex data schemas or perform more complexqueries than traditional transaction process-ing. We also observe that a traditional enter-prise web workload (SPECweb09) behavesdifferently than the Web Frontend workload,representative of modern scale-out configura-tions. Although the traditional web workloadis dominated by serving static files and afew dynamic scripts, modern scalable web

workloads like Web Frontend handle amuch higher fraction of dynamic requests,leading to higher core utilization and less OSinvolvement.

Although the behavior across scale-outworkloads is similar, the class of scale-outworkloads as a whole differs significantlyfrom other workloads. Processor architec-tures optimized for desktop and parallelapplications are not optimized for scale-outworkloads that spend most of their time wait-ing for cache misses, resulting in a clearmicroarchitectural mismatch. At the sametime, architectures designed for workloadsthat perform only trivial computation andspend all of their time waiting on memory(such as SPECweb09 and TPC-C) also can-not cater to scale-out workloads.

Front-end inefficienciesThere are three major front-end ineffi-

ciencies:

� Cores are idle because of highinstruction-cache miss rates.

� L2 caches increase average instruc-tion-fetch latency.

� Excessive LLC capacity leads to longinstruction-fetch latency.

Instruction-fetch stalls play a critical rolein processor performance by preventing thecore from making forward progress becauseof a lack of instructions to execute. Front-end

0%

Data C

achin

g

Data S

ervin

g

MapRed

uce

Media

Stream

ing

SAT S

olver

Web

Fron

tend

Web

Sea

rch

PARSEC (c

pu)

PARSEC (m

em)

SPEC2006

(cpu)

SPEC2006

(mem

)

SPECweb09

TPC-C

TPC-E

Web

Bac

kend

25%

50%

75%

100%

Tota

l exe

cutio

n cy

cles

Stalled (OS) Stalled (Application) Committing (Application) Committing (OS) Memory

Figure 1. Execution-time breakdown and memory cycles of scale-out workloads (left) and traditional benchmarks (right).

Execution time is further broken down into its application and operating system components.

..............................................................................................................................................................................................

TOP PICKS

............................................................

34 IEEE MICRO

stalls serve as a fundamental source of ineffi-ciency for both area and power, because thecore real estate and power consumption areentirely wasted for the cycles that the frontend spends fetching instructions.

Figure 2 presents the instruction missrates of the L1 instruction cache and the L2cache. In contrast to desktop and parallelbenchmarks, the instruction working sets ofmany scale-out workloads considerablyexceed the capacity of the L1 instructioncache, resembling the instruction-cachebehavior of traditional server workloads.Moreover, the instruction working sets ofmost scale-out workloads also exceed the L2cache capacity, where even relatively infre-quent instruction misses incur considerableperformance penalties. We find that modernprocessor architectures can’t tolerate thelatency of the L1 instruction cache’s misses,avoiding front-end stalls only for applicationswhose entire instruction working set fits intothe L1 cache. Furthermore, the high L2instruction miss rates indicate that the L1instruction cache’s capacity experiences a sig-nificant shortfall and can’t be mitigated bythe addition of a modestly sized L2 cache.

The disparity between the needs ofthe scale-out workloads and the processorarchitecture are apparent in the instruction-fetch path. Although exposed instruction-

fetch stalls serve as a key source of inefficiencyunder any circumstances, the instruction-fetch path of modern processors actuallyexacerbates the problem. The L2 cache expe-riences high instruction miss rates, increasingthe average fetch latency of the missing fetchrequests by placing an additional intermedi-ate lookup structure on the path to retrieveinstruction blocks from the LLC. Moreover,the entire instruction working set of anyscale-out workload is considerably smallerthan the LLC capacity. However, because theLLC is a large cache with a high uniformaccess latency, it contributes an unnecessarilylarge instruction-fetch penalty (29 cycles toaccess the 12-Mbyte cache).

To improve efficiency and reduce front-end stalls, processors built for scale-out work-loads must bring instructions closer to thecores. Rather than relying on a deep hier-archy of caches, a partitioned organizationthat replicates instructions and makes themavailable close to the requesting cores8 islikely to considerably reduce front-end stalls.To effectively use the on-chip real estate, thesystem would need to share the partitionedinstruction caches among multiple cores,striking a balance between the die area dedi-cated to replicating instruction blocks andthe latency of accessing those blocks from theclosest cores.

0

25

50

75

100

Inst

ruct

ion

mis

ses

per

k-in

stru

ctio

n

L1-I (OS) L1-I (Application)

L2 (OS)

L2 (Application)

146117

Data C

achin

g

Data S

ervin

g

MapRed

uce

Media

Stream

ing

Sat Solv

er

Web

Fron

tend

Web

Sea

rch

PARSEC

SPEC2006

SPECweb09

TPC-C

TPC-E

Web

Bac

kend

Figure 2. L1 and L2 instruction miss rates for scale-out workloads (left) and traditional benchmarks (right). The miss rate is

broken down into its application and operating system components.

.............................................................

MAY/JUNE 2014 35

Furthermore, although modern process-ors include next-line prefetchers, highinstruction-cache miss rates and significantfront-end stalls indicate that the prefetchersare ineffective for scale-out workloads. Scale-out workloads are written in high-level lan-guages, use third-party libraries, and executeoperating system code, exhibiting complexnonsequential access patterns that are notcaptured by simple next-line prefetchers.Including instruction prefetchers that predictthese complex patterns is likely to improveoverall processor efficiency by eliminatingwasted cycles due to front-end stalls.

Core inefficienciesThere are two major core inefficiencies:

� Low ILP precludes effectively usingthe full core width.

� The reorder buffer (ROB) and theload-store queue (LSQ) are underu-tilized because of low MLP.

Modern processors execute instructionsout of order to enable simultaneous execu-tion of multiple independent instructions percycle (IPC). Additionally, out-of-order execu-tion elides stalls due to memory accesses byexecuting independent instructions that fol-low a memory reference while the long-latency cache access is in progress. Modernprocessors support up to 128-instructionwindows, with the width of the processordictating the number of instructions that

can simultaneously execute in one cycle. Inaddition to exploiting ILP, large instructionwindows can exploit MLP by finding inde-pendent memory accesses within the instruc-tion window and performing the memoryaccesses in parallel. The latency of LLC hitsand off-chip memory accesses cannot be hid-den by out-of-order execution; achievinghigh MLP is therefore key to achieving highcore utilization by reducing the data accesslatency.

The processors we study use four-widecores that can decode, issue, execute, andcommit up to four instructions on eachcycle. However, in practice, ILP is limitedby dependencies. The Baseline bars inFigure 3a show the average number ofinstructions committed per cycle whenrunning on an aggressive four-wide out-of-order core. Despite the abundant availabil-ity of core resources and functional units,scale-out workloads achieve a modestapplication IPC, typically in the range of0.6 (Data Caching and Media Streaming)to 1.1 (Web Frontend). Although thereexist workloads that can benefit from widecores, with some CPU-intensive Parsec andSPEC2006 applications reaching an IPC of2.0 (indicated by the range bars in the fig-ure), using wide processors for scale-outapplications does not yield a significantbenefit.

Modern processors have 32-entry or largerload-store queues, enabling many memory-

0

1

2

3

4

Ap

plic

atio

n IP

C Baseline SMT

0

2

4

6

8

Ap

plic

atio

n M

LP Baseline SMT

Data C

achin

g

Data S

ervin

g

MapRed

uce

Media

Stream

ing

SAT S

olver

Web

Fron

tend

Web

Sea

rch

PARSEC (c

pu)

PARSEC (m

em)

SPEC2006

(cpu)

SPEC2006

(mem

)

SPECweb09

TPC-C

TPC-E

Web

Bac

kend

Data C

achin

g

Data S

ervin

g

MapRed

uce

Media

Stream

ing

SAT S

olver

Web

Fron

tend

Web

Sea

rch

PARSEC (c

pu)

PARSEC (m

em)

SPEC2006

(cpu)

SPEC2006

(mem

)

SPECweb09

TPC-C

TPC-E

Web

Bac

kend

Figure 3. The instructions per cycle (IPC) and memory-level parallelism (MLP) of a simultaneous multithreading (SMT) enabled

core. Application IPC for systems with and without SMT out of a maximum IPC of 4 (a). MLP for systems with and without

SMT (b). Range bars indicate the minimum and maximum of the corresponding group.

..............................................................................................................................................................................................

TOP PICKS

............................................................

36 IEEE MICRO

reference instructions in the 128-instructionwindow. However, just as instruction depend-encies limit ILP, address dependencies limitMLP. The Baseline bars in Figure 3b presentthe MLP, ranging from 1.4 (Web Frontend)to 2.3 (SAT Solver) for the scale-out work-loads. These results indicate that the memoryaccesses in scale-out workloads are repletewith complex dependencies, limiting theMLP that can be found by modern aggressiveprocessors. We again note that while desktopand parallel applications can use high-MLPsupport, with some Parsec and SPEC2006applications having an MLP up to 5.0, sup-port for high MLP is not useful for scale-outapplications. However, we find that scale-outworkloads generally exhibit higher MLP thantraditional server workloads. Noting thatsuch characteristics lend themselves well tomultithreaded cores, we examine the IPC andMLP of an SMT-enabled core in Figure 3. Asexpected, the MLP found and exploited bythe cores when two independent applicationthreads run on each core concurrently nearlydoubles compared to the system withoutSMT. Unlike traditional database serverworkloads that contain many inter-threaddependencies and locks, the independentnature of threads in scale-out workloads ena-bles them to observe considerable perform-ance benefits from SMT, with 39 to 69percent improvements in IPC.

Support for four-wide out-of-order execu-tion with a 128-instruction window and upto 48 outstanding memory requests requiresmultiple-branch prediction, numerous arith-metic logic units (ALUs), forwarding paths,many-ported register banks, large instructionscheduler, highly associative ROB and LSQ,and many other complex on-chip structures.The complexity of the cores limits corecount, leading to chip designs with severalcores that consume half the available on-chipreal estate and dissipate the vast majority ofthe chip’s dynamic power budget. However,our results indicate that scale-out workloadsexhibit low ILP and MLP, deriving benefitonly from a small degree of out-of-order exe-cution. As a result, the nature of scale-outworkloads cannot effectively utilize theavailable core resources. Both the die areaand the energy are wasted, leading to data-center inefficiency.

The nature of scale-out workloads makesthem ideal candidates to exploit multi-threaded multicore architectures. Modernmainstream processors offer excessively com-plex cores, resulting in inefficiency throughresource waste. At the same time, our resultsindicate that niche processors offer excessivelysimple (for example, in-order) cores that can-not leverage the available ILP and MLP inscale-out workloads. We find that scale-outworkloads match well with architecturesoffering multiple independent threads percore with a modest degree of superscalarout-of-order execution and support for sev-eral simultaneously outstanding memoryaccesses. For example, rather than imple-menting SMT on a four-way core, we coulduse two independent two-way cores, whichwould consume fewer resources while achiev-ing higher aggregate performance. Further-more, each narrower core would not requirea large instruction window, reducing the per-core area and power consumption comparedto modern processors and enabling highercomputational density by integrating morecores per chip.

Data-access inefficienciesThere are two major data-access ineffi-

ciencies:

� Large LLC consumes area, but doesnot improve performance.

� Simple data prefetchers are ineffective.

More than half of commodity processordie area is dedicated to the memory system.Modern processors feature a three-level cachehierarchy, where the LLC is a large-capacitycache shared among all cores. To enablehigh-bandwidth data fetch, each core canhave up to 16 L2 cache misses in flight. Thehigh-bandwidth on-chip interconnect ena-bles cache-coherent communication betweenthe cores. To mitigate the capacity andlatency gap between the L2 caches and theLLC, each L2 cache is equipped with pre-fetchers that can issue prefetch requests intothe LLC and off-chip memory. MultipleDDR3 memory channels provide high-band-width access to off-chip memory.

The LLC is the largest on-chip structure;its cache capacity has been increasingwith each processor generation, thanks to

.............................................................

MAY/JUNE 2014 37

semiconductor manufacturing improve-ments. We investigate the utility of growingthe LLC capacity for scale-out workloads inFigure 4 through a cache sensitivity analysisby dedicating two cores to cache-pollutingthreads. The polluter threads traverse arraysof predetermined size in a pseudorandomsequence, ensuring that all accesses miss inthe upper-level caches and reach the LLC.We use performance counters to confirm that

the polluter threads achieve nearly a 100 per-cent hit ratio in the LLC, effectively reducingthe cache capacity available for the workloadrunning on the remaining cores of the sameprocessor.

We plot the average system performanceof scale-out workloads as a function of theLLC capacity, normalized to a baseline sys-tem with a 12-Mbyte LLC. Unlike in thememory-intensive desktop applications (suchas SPEC2006 mcf), we find minimal per-formance sensitivity to LLC size above 4 to 6Mbytes in scale-out and traditional serverworkloads. The LLC captures the instructionworking sets of scale-out workloads, whichare less than 2 Mbytes. Beyond this point,small shared supporting structures may con-sume another 1 to 2 Mbytes. Because scale-out workloads operate on massive datasetsand service a large number of concurrentrequests, both the dataset and the per-clientdata are orders of magnitude larger than theavailable on-chip cache capacity. As a result,an LLC that captures the instruction workingset and minor supporting data structuresachieves nearly the same performance as anLLC with double or triple the capacity.

In addition to leveraging MLP to overlapdemand requests from the processor core,modern processors use prefetching to specu-latively increase MLP. Prefetching has beenshown effective at reducing cache miss ratesby predicting block addresses that will be ref-erenced in the future and bringing theseblocks into the cache prior to the processor’sdemand, thereby hiding the access latency. InFigure 5, we present the hit ratios of the L2cache when all available prefetchers areenabled (Baseline), as well as the hit ratiosafter disabling the prefetchers. We observe anoticeable degradation of the L2 hit ratios ofmany desktop and parallel applications whenthe adjacent-line prefetcher and L2 hardwareprefetcher are disabled. In contrast, only oneof the scale-out workloads (MapReduce) sig-nificantly benefits from these prefetchers,with the majority of the workloads experienc-ing negligible changes in the cache hit rate.Moreover, similar to traditional server work-loads (TPC-C), disabling the prefetchersresults in an increase in the hit ratio for somescale-out workloads (Data Caching, MediaStreaming, and SAT Solver). Finally, we note

Scale-out Server SPEC2006 (mcf)

0.6

0.7

0.8

0.9

1.0

Use

r IP

Cno

rmal

ized

to b

asel

ine

0.5

0.6

0.7

4 5 6 7 8 9 10 11

Cache size (Mbytes)

Figure 4. Performance sensitivity to the last-level cache (LLC) capacity.

Relatively small average performance degradation due to reduced cache

capacity is shown for the scale-out and server workloads, in contrast to

some traditional applications (such as mcf).

0%

Data C

achin

g

Data S

ervin

g

MapRed

uce

Media

Stream

ing

SAT Solv

er

Web

Fron

tend

Web

Sea

rch

PARSEC

SPEC2006

SPECweb09

TPC-C

TPC-E

Web

Bac

kend

25%

50%

75%

100%

L2 h

it ra

tio

Baseline (all enabled) Adjacent-line disabled HW prefetcher disabled

Figure 5. L2 hit ratios of a system with enabled and disabled adjacent-line

and HW prefetchers. Unlike for Parsec and SPEC2006 applications, minimal

performance difference is observed for the scale-out and server workloads.

..............................................................................................................................................................................................

TOP PICKS

............................................................

38 IEEE MICRO

that the DCU streamer (not shown) providesno benefit to scale-out workloads, and insome cases marginally increases the L2 missrate because it pollutes the cache withunnecessary blocks.

Our results show that the on-chip resour-ces devoted to the LLC are one of the keylimiters of scale-out application computa-tional density in modern processors. For tra-ditional workloads, increasing the LLCcapacity captures the working set of a broaderrange of applications, contributing toimproved performance, owing to a reductionin average memory latency for those applica-tions. However, because the LLC capacityalready exceeds the scale-out applicationrequirements by 2 to 3 times, whereas thenext working set exceeds any possible SRAMcache capacity, the majority of the die areaand power currently dedicated to the LLC iswasted. Moreover, prior research has shownthat increases in the LLC capacity that do notcapture a working set lead to an overall per-formance degradation; LLC access latency ishigh due to its large capacity, not only wast-ing on-chip resources, but also penalizing allL2 cache misses by slowing down LLC hitsand delaying off-chip accesses.

Although modern processors grossly over-provision the memory system, we canimprove datacenter efficiency by matchingthe processor design to the needs of the scale-out workloads. Whereas modern processorsdedicate approximately half of the die area tothe LLC, scale-out workloads would likelybenefit from a different balance. A two-levelcache hierarchy with a modestly sized LLCthat makes a special provision for cachinginstruction blocks would benefit perform-ance. The reduced LLC capacity along withthe removal of the ineffective L2 cache wouldoffer access-latency benefits while also freeingup die area and power. The die area andpower can be applied toward improvingcomputational density and efficiency by add-ing more hardware contexts and moreadvanced prefetchers. Additional hardwarecontexts (more threads per core and morecores) should linearly increase applicationparallelism, and more advanced correlatingdata prefetchers could accurately prefetchcomplex access data patterns and increase theperformance of all cores.

Bandwidth inefficienciesThe major bandwidth inefficiencies are

� Lack of data sharing deprecatescoherence and connectivity.

� Off-chip bandwidth exceeds needs byan order of magnitude.

Increasing core counts have brought par-allel programming into the mainstream,highlighting the need for fast and high-band-width inter-core communication. Multi-threaded applications comprise a collectionof threads that work in tandem to scale upthe application performance. To enable effec-tive scale-up, each subsequent generation ofprocessors offers a larger core count andimproves the on-chip connectivity to supportfaster and higher-bandwidth core-to-corecommunication.

We investigate the utility of the on-chipinterconnect for scale-out workloads inFigure 6. To measure the frequency of read-write sharing, we execute the workloads oncores split across two physical processors inseparate sockets. When reading a recentlymodified block, this configuration forcesaccesses to actively shared read-write blocksto appear as off-chip accesses to a remote pro-cessor cache. We plot the fraction of L2misses that access data most recently writtenby another thread running on a remote core,breaking down each bar into Application and

0.0%

2.5%

5.0%

7.5%

10.0%

Rea

d-w

rite

shar

ed L

LC h

itsno

rmal

ized

to L

LC d

ata

refe

renc

es

Application OS

23%

Data C

achin

g

Data S

ervin

g

MapRed

uce

Media

Stream

ing

SAT Solv

er

Web

Fron

tend

Web

Sea

rch

PARSEC

SPECweb09

TPC-C

TPC-E

Web

Bac

kend

Figure 6. Percentage of LLC data references accessing cache blocks

modified by a remote core. In scale-out workloads, the majority of the

remotely accessed cache blocks are from the operating system code.

.............................................................

MAY/JUNE 2014 39

OS components to offer insight into thesource of the data sharing.

In general, we observe limited read-writesharing across the scale-out applications. Wefind that the OS-level data sharing is domi-nated by the network subsystem, seen mostprominently in the Data Caching workload,which spends the majority of its time in theOS. This observation highlights the need tooptimize the OS to reduce the amount of falsesharing and data movement in the schedulerand network-related data structures. Multi-threaded Java-based applications (Data Serv-ing and Web Search) exhibit a small degree ofsharing due to the use of a parallel garbagecollector that may run a collection thread ona remote core, artificially inducing applica-tion-level communication. Additionally, wefound that the Media Streaming serverupdates global counters to track the totalnumber of packets sent; reducing the amountof communication by keeping per-thread sta-tistics is trivial and would eliminate the mutexlock and shared-object scalability bottle-neck—an optimization that is already presentin the Data Caching server we use. The on-chip application-level communication inscale-out workloads is distinctly differentfrom traditional database server workloads(TPC-C, TPC-E, and Web Backend), whichexperience frequent interaction between threads

on actively shared data structures that are usedto service client requests.

The low degree of active sharing indicatesthat wide and low-latency interconnectsavailable in modern processors are overprovi-sioned for scale-out workloads. Although theoverhead with a small number of cores is lim-ited, as the number of cores on chip increases,the area and energy overhead of enforcingcoherence becomes significant. Likewise, thearea overheads and power consumption of anoverprovisioned high-bandwidth intercon-nect further increase processor inefficiency.

Beyond the on-chip interconnect, we alsofind off-chip bandwidth inefficiency. Whilethe off-chip memory latency has improvedslowly, off-chip bandwidth has been improv-ing at a rapid pace. Over the course of twodecades, the memory bus speeds have in-creased from 66 MHz to dual-data-rate atover 1 GHz, raising the peak theoretical band-width from 544 Mbytes/second to 17 Gbytes/second per channel, with the latest server pro-cessors having four independent memorychannels. In Figure 7, we plot the per-coreoff-chip bandwidth utilization of our work-loads as a fraction of the available per-core off-chip bandwidth. Scale-out workloads experi-ence nonnegligible off-chip miss rates, but theMLP of the applications is low, owing to thecomplex data structure dependencies. Thecombination of low MLP and the small num-ber of hardware threads on the chip leads tolow aggregate off-chip bandwidth utilizationeven when all cores have outstanding off-chipmemory accesses. Among the scale-out work-loads we examine, Media Streaming is theonly application that uses up to 15 percent ofthe available off-chip bandwidth. However,our applications are configured to stress theprocessor, actually demonstrating the worst-case behavior. Overall, modern processors arenot able to utilize the available memory band-width, which is significantly over-provisionedfor scale-out workloads.

The on-chip interconnect and off-chipmemory buses can be scaled back to improveprocessor efficiency. Because the scale-outworkloads perform only infrequent communi-cation via the network, there is typically noread-write sharing in the applications; pro-cessors can therefore be designed as a collec-tion of core islands using a low-bandwidth

0%

5%

Data C

achin

g

Data S

ervin

g

MapRed

uce

Media

Stream

ing

SAT Solv

er

Web

Sea

rch

Web

Fron

tend

PARSEC (cpu)

PARSEC (mem

)

SPEC2006

(cpu)

SPEC2006

(mem

)

SPECweb09

TPC-E

Web

Bac

kend

TPC-C

10%

15%

20%

Off-

chip

mem

ory

ban

dw

idth

util

izat

ion

Application OS

Figure 7. Average off-chip memory bandwidth utilization as a percentage of

available off-chip bandwidth. Even at peak system utilization, all workloads

exercise only a small fraction of the available memory bandwidth.

..............................................................................................................................................................................................

TOP PICKS

............................................................

40 IEEE MICRO

interconnect that does not enforce coherencebetween the islands, eliminating the powerassociated with the high-bandwidth intercon-nect as well as the power and area overheads offine-grained coherence tracking.4 Off-chipmemory buses can be optimized for scale-outworkloads by scaling back unnecessary band-width for systems with an insufficient numberof cores. Memory controllers consume a largefraction of the chip area, and memory buses areresponsible for a large fraction of the systempower. Reducing the number of memory chan-nels and the power draw of the memory busesshould improve scale-out workload efficiencywithout affecting application performance.However, instead of taking a step backward andscaling back the memory bandwidth to matchthe requirements and throughput of conven-tional processors, a more effective solutionwould be to increase the processor throughputthrough specialization and thus utilize the avail-able bandwidth resources.4

T he impending plateau of voltage levelsand a continued increase in chip den-

sity are forcing efficiency to be the primarydriver of future processor designs. Ouranalysis shows that efficiently executing scale-out workloads requires optimizing theinstruction-fetch path for multi-megabyteinstruction working sets; reducing the coreaggressiveness and LLC capacity to free areaand power resources in favor of more cores,each with more hardware threads; and scalingback the overprovisioned on-chip and off-chip bandwidth. We demonstrate that mod-ern processors, built to accommodate a broadrange of workloads, sacrifice efficiency, andthat current processor trends serve to furtherexacerbate the problem. On the other hand,we outline steps that can be taken to special-ize processors for the key workloads of thefuture, enabling efficient execution by closelyaligning the processor microarchitecturewith the microarchitectural needs of scale-out workloads. Following these steps canresult in up to an order of magnitude im-provement in throughput per processor chip,and in the overall datacenter efficiency.5 MICRO

AcknowledgmentsWe thank the reviewers and readers for

their feedback and suggestions on all earlier

versions of this work. We thank the PARSAlab for continual support and feedback, inparticular Pejman Lotfi-Kamran and JavierPicorel for their assistance with the SPEC-web09 and SAT Solver benchmarks. Wethank the DSLab for their assistance withSAT Solver, and Aamer Jaleel and CaroleJean-Wu for their assistance with under-standing the Intel prefetchers and configura-tion. We thank the EuroCloud projectpartners for advocating and inspiring theCloudSuite benchmark suite. This work waspartially supported by EuroCloud, projectno. 247779 of the European Commission7th RTD Framework Programme–Informa-tion and Communication Technologies:Computing Systems.

....................................................................References1. M. Horowitz et al., “Scaling, Power, and the

Future of CMOS,” Proc. Electron Devices

Meeting, 2005, pp. 7-15.

2. N. Hardavellas et al., “Toward Dark Silicon

in Servers,” IEEE Micro, vol. 31, no. 4,

2011, pp. 6-15.

3. M. Ferdman et al., “Clearing the Clouds: A

Study of Emerging Scale-Out Workloads on

Modern Hardware,” Proc. 17th Int’l Conf.

Architectural Support for Programming

Languages and Operating Systems, 2012,

pp. 37-48.

4. P. Lotfi-Kamran et al., “Scale-Out Process-

ors,” Proc. 39th Int’l Symp. Computer Archi-

tecture, 2012, pp. 500-511.

5. B. Grot et al., “Optimizing Data-Center TCO

with Scale-Out Processors,” IEEE Micro,

vol. 32, no. 5, 2011, pp. 52-63.

6. H. Esmaeilzadeh et al., “Dark Silicon and

the End of Multicore Scaling,” Proc. 38th

Int’l Symp. Computer Architecture, 2011,

pp. 365-376.

7. G. Venkatesh et al., “Conservation Cores:

Reducing the Energy of Mature Com-

putations,” Proc. 15th Conf. Architectural

Support for Programming Languages and

Operating Systems, 2010, pp. 205-218.

8. N. Hardavellas et al., “Reactive NUCA:

Near-Optimal Block Placement and Replica-

tion in Distributed Caches,” Proc. 36th Int’l

Symp. Computer Architecture, 2009, pp.

184-195.

.............................................................

MAY/JUNE 2014 41

Michael Ferdman is an assistant professorin the Department of Computer Science atStony Brook University. His research focuseson computer architecture, particularly onserver system design. Ferdman has a PhD inelectrical and computer engineering fromCarnegie Mellon University.

Almutaz Adileh is a PhD candidate in theDepartment of Computer Science at GhentUniversity. His research focuses on computerarchitecture, particularly on improving per-formance in power-limited chips. Adileh hasan MSc in computer engineering from theUniversity of Southern California.

Onur Kocberber is a PhD candidate in theSchool of Computer and CommunicationSciences at �Ecole Polytechnique F�ed�erale deLausanne. His research focuses on special-ized architectures for server systems. Koc-berber has an MSc in computer engineeringfrom TOBB University of Economics andTechnology.

Stavros Volos is a PhD candidate in theSchool of Computer and CommunicationSciences at �Ecole Polytechnique F�ed�erale deLausanne. His research focuses on com-puter architecture, particularly on memorysystems for high-throughput and energy-aware computing. Volos has a Dipl-Ing inelectrical and computer engineering fromthe National Technical University of Athens.

Mohammad Alisafaee performed the workfor this article while he was a researcher inthe School of Computer and Communica-tion Sciences at �Ecole Polytechnique F�ed-rale de Lausanne. His research interestsinclude multiprocessor cache coherence andmemory system design for commercialworkloads. Alisafaee has an MSc in electricaland computer engineering from the Univer-sity of Tehran.

Djordje Jevdjic is a PhD candidate in theSchool of Computer and CommunicationSciences at �Ecole Polytechnique F�ed�eralede Lausanne. His research focuses on high-performance memory systems for servers,including on-chip DRAM caches and

3D-die stacking, with an emphasis on local-ity and energy efficiency. Jevdjic has an MScin electrical and computer engineering fromthe University of Belgrade.

Cansu Kaynak is a PhD candidate in theSchool of Computer and CommunicationSciences at �Ecole Polytechnique F�ed�erale deLausanne. Her research focuses on serversystems, especially memory system design.Kaynak has a BSc in computer engineeringfrom TOBB University of Economics andTechnology.

Adrian Daniel Popescu is a PhD candidatein the School of Computer and Communi-cation Sciences at �Ecole PolytechniqueF�ed�erale de Lausanne. His research focuseson the intersection of database managementsystems with distributed systems, specificallyquery performance prediction. Popescu hasan MSc in electrical and computer engineer-ing from the University of Toronto.

Anastasia Ailamaki is a professor in theSchool of Computer and CommunicationSciences at �Ecole Polytechnique F�ed�erale deLausanne. Her research interests includeoptimizing database software for emerginghardware and I/O devices and automatingdatabase management to support scientificapplications. Ailamaki has a PhD in com-puter science from the University ofWisconsin-Madison.

Babak Falsafi is a professor in the School ofComputer and Communication Sciences at�Ecole Polytechnique F�ed�erale de Lausanneand the founding director of EcoCloud, aninterdisciplinary research center targeting ro-bust, economic, and environmentally friendlycloud technologies. Falsafi has a PhD in com-puter science from the University of Wiscon-sin-Madison.

Direct questions and comments about thisarticle to Michael Ferdman, Stony Brook Uni-versity, 1419 Computer Science, Stony Brook,NY 11794; [email protected].

..............................................................................................................................................................................................

TOP PICKS

............................................................

42 IEEE MICRO

ACASE FOR SPECIALIZED PROCESSORS S -OUT WORKLOADS · benchmark suites: Parsec 2.1 Parallel work-loads, SPEC CPU2006 desktop and engineer-ing workloads, SPECweb09 traditional web services,

Documents