The Manycore Revolution: Will HPC Lead or Follow?people.eecs.berkeley.edu/~krste/papers/manycore-scidac2009.pdf · The incremental path towards “multi-core” chips (two, four,

M U LT I C O R E C O M P U T I N G

Rumors of the death of Moore’s Law are greatly exaggerated, according to ateam of computer scientists from Lawrence Berkeley National Laboratory (LBNL)and the University of California (UC)–Berkeley. In their view, Gordon Moore’sobservation that the amount of computing power packed onto a chip doublesabout every 18 months while the cost remains flat is alive and well. But thephysics is changing.

Industry clung to the single-core model for aslong as possible, arguably over-engineering thecores to eke out a few percentage points ofincreased performance. But the complex coredesigns of the past required enormous area,power, and complexity to maximize serial per-formance. Now, heat density and power con-straints have all but ended 15 years of exponentialclock frequency growth. The industry hasresponded by halting clock rate improvementsand increases in core sophistication, and isinstead doubling the number of cores every 18months. The incremental path towards “multi-core” chips (two, four, or eight cores) has alreadyforced the software industry to take the dauntingleap into explicit parallelism. Once you give upon serial code, it is much easier to take the leaptowards higher degrees of parallelism—perhapshundreds of threads. If you can express more par-allelism efficiently, it opens up a more aggressiveapproach, termed “manycore,” that uses muchsimpler, lower-frequency cores to achieve moresubstantial power and performance benefits thancan be achieved by the incremental path. With themanycore design point, hundreds to thousandsof computational threads per chip are possible.Ultimately, all of these paths lead us into an era ofexponential growth in explicit parallelism.

However, fully unleashing the potential of themanycore approach to ensure future advances insustained computational performance will

require fundamental advances in computer archi-tecture and programming models—advancesthat are nothing short of reinventing computing.This in turn will result in a new parallel program-ming paradigm, which is already being exploredby Berkeley researchers in a five-year programfunded by Microsoft, Intel, and California’s UCDiscovery program.

Recent trends in the microprocessor industryhave important ramifications for the design of thenext generation of high-performance computing(HPC) systems as we look beyond the petascale.The need to switch to an exponential growth pathin system concurrency is leading to reconsidera-tion of interconnect design, memory balance, andI/O system design that will have dramatic conse-quences for the design of future HPC applicationsand algorithms. The required reengineering ofexisting application codes will likely be as dramaticas the migration from vector HPC systems to mas-sively parallel processors (MPPs) that occurred inthe early 1990s. Such comprehensive code re-engi-neering took nearly a decade for the HPC commu-nity, so there are serious concerns aboutundertaking yet another major transition in oursoftware infrastructure. However, the mainstreamdesktop and handheld computing industry haseven less experience with parallelism than the HPCcommunity. The transition to explicit on-chip par-allelism has proceeded without any strategy inplace for writing parallel software.

40 S C I D A C R E V I E W F A L L 2 0 0 9 W W W . S C I D A C R E V I E W . O R G

Recent trends in themicroprocessor industryhave importantramifications for thedesign of the nextgeneration of high-performance computingsystems as we lookbeyond the petascale.

The MANYCORERevolution: Will HPC

LEAD or FOLLOW?

Fa09 40-49 Multicore.qxd 8/21/09 1:44 PM Page 40

Although industry is making moves in thisdirection and starting to get its feet wet, the HPCcommunity has yet to take up the cause. But asone member of the Berkeley team puts it, “Tomove forward in HPC, we’re all going to have toget on the parallelism crazy train.”

The Berkeley team laid out their ideas in a paperentitled “A View of the Parallel Computing Land-scape” published in late 2006. Known informallyas “The Berkeley View,” the paper continues todraw interest and spur discussion. When co-author David Patterson of UC–Berkeley presentedan overview at the SC08 conference in Austin inNovember 2008, hundreds of attendees turnedout for his invited talk, which addressed “the mul-ticore/manycore sea change.”

A Shift Driven by IndustryThe trend toward parallelism is already underwayas the industry moves to multicores. Desktop sys-tems featuring dual-core processors and evenHPC are getting in on the act. In the past year, theCray XT supercomputers at Oak Ridge NationalLaboratory (ORNL) and NERSC were upgradedfrom dual-core to quad-core processors. Theupgrade not only doubled the number of cores, italso doubled parallelism of the SIMD floating-point units to get a net 3.5× increase in peak floprate for a 2× increase in core count. While theseswaps were relatively quick paths to dramaticincreases in performance, the conventional mul-ticore approach (two, four, and even eight cores)adopted by the computing industry will eventu-ally hit a performance plateau, just as traditionalsources of performance improvements such asinstruction-level parallelism (ILP) and clock fre-quency scaling have been flattening since 2003,as shown in figure 1.

Figure 2 shows the improvements in processorperformance as measured by the SPEC integerbenchmark over the period from 1975 to present.Since 1986 performance has improved by 52%per year with remarkable consistency. Duringthat period, as process geometries scaled down-ward according to Moore’s Law, the active capac-itance of circuits and supply voltages also scaleddown. This approach, known as constant electricfield frequency scaling, fed the relentless increasesin CPU clock rates over the past decade and a half.As manufacturers have shrunk chip featuresbelow the 90 nm scale, however, this techniquebegan to hit its limits as the transistor thresholdvoltage could not be scaled ideally without losinglarge amounts of power to leakage, and this inturn meant the supply voltage could not bereduced correspondingly. Processors soon startedhitting a power wall, as adding more power-hungry transistors to a core gave only incremen-

tal improvements in serial performance—all in abid to avoid exposing the programmer to explicitparallelism. Some chip designs, such as the IntelTejas, were ultimately cancelled due to powerconsumption issues.

“With the desktop, the mantra was perform-ance at any cost and there really was no interestin computational efficiency—brute force won theday,” says John Shalf of LBNL. “But the move tomulticore shows that computational efficiencymatters once again.”

This issue of power density has now becomethe dominant constraint in the design of new

41S C I D A C R E V I E W F A L L 2 0 0 9 W W W . S C I D A C R E V I E W . O R G

The trend towardparallelism is alreadyunderway as the industrymoves to multicores.

Figure 1. This graph shows that Moore’s law is alive and well, but the traditional sourcesof performance improvements (ILP and clock frequencies) have all been flattening.

10,000,000

1,000,000

100,000

10,000

1,000

100

10

1

01970 1975 1980 1985 1990 1995 2000 2005 2010

Transistors (000)Clock Speed (MHz)Power (W)Perf/Clock (ILP)

Year

Figure 2. The performance of processors as measured by SpecINT has grown 52% peryear with remarkable consistency, but improvements tapered off around 2002.

10,000

1,000

100

10

11978 1982 1986 1990 1994 1998 2002 2006

Perfo

rman

ce (v

ersu

s VA

X-11

/780

)

Year

25%/Year

52%/Year

??%/Year

ILLUS

TRA

TION: A. TO

VEYS

OU

RC

E: D. P

ATTER

SO

N, UC

–BER

KELEY

ILLUS

TRA

TION: A. TO

VEYS

OU

RC

E: D. P

ATTER

SO

N, UC

–BER

KELEY



processing elements, and ultimately limits clockfrequency growth for future microprocessors.The direct result has been a stall in clock fre-quency that is reflected in the flattening of theperformance growth rates starting in 2002. In2006, individual processor cores are nearly a fac-tor of three slower than if progress had contin-ued at the historical rate of the preceding decade.This has led to a significant course correction inthe IT semiconductor roadmap, as shown in figure 3.

Other approaches for extracting more per-formance such as out-of-order instruction pro-cessing to increase ILP have also delivereddiminishing returns (figure 1, p41). Havingexhausted other well-understood avenues toextract more performance from a uniprocessor,the mainstream microprocessor industry hasresponded by halting further improvements inclock frequency and increasing the number ofcores on the chip. In fact, as Patterson noted inhis presentation at the SC08 conference inAustin, AMD, Intel, IBM, and Sun now sell moremulticore chips than uniprocessor chips.Whether or not users are ready for the newprocessors, Patterson and John Hennessy—Pres-ident of Stanford and co-author of the famousarchitecture textbook with Patterson—estimatethe number of cores per chip is likely to doubleevery 18–24 months henceforth. Hennessy hascommented, “if I were still in the computingindustry, I’d be very worried right now.” Newchips currently on the drawing boards and which

will appear over the next five years are parallel,Patterson notes. Therefore, new algorithms andprogramming models will need to stay ahead ofa wave of geometrically increasing system con-currency—a tsunami of parallelism.

The stall in clock frequencies and the industry’scomparatively straightforward response of dou-bling cores has reinvigorated study of more rad-ical alternative approaches to computing such asField-Programmable Gate Arrays (FPGAs), gen-eral-purpose programming of Graphics Process-ing Units (GPGPUs), and even dataflow-like tiledarray architectures such as the TRIPS project atthe University of Texas–Austin. The principalimpediment to adapting a more radical approachto hardware architecture is that we know evenless about how to program efficiently suchdevices for diverse applications than we do paral-lel machines composed of multiple CPU cores.Kurt Keutzer puts this more elegantly when hestates, “The shift toward increasing parallelism isnot a triumphant stride forward based on break-throughs in novel software and architectures forparallelism; instead, this plunge into parallelismis actually a retreat from even greater challengesthat thwart efficient silicon implementation oftraditional uniprocessor architectures.” To get atthe heart of Keutzer’s statement, it is necessary todeconstruct the most serious problems with cur-rent CPU core designs.

New IC Design ConstraintsThe problem of current leakage, which limitscontinued performance improvements based onclock frequency scaling, is not the only con-straint helping to push the semiconductor indus-try in new directions. The other problem facingthe industry is the extremely high cost of newlogic designs, which creates pressure to simplifyand shrink core designs. In order to squeezemore performance out of processor cores at highfrequencies, a considerable amount of surfacearea is devoted to latency hiding technology, suchas deep execution pipelines and out-of-orderinstruction processing. The phenomenally com-plex logic required to implement such featureshas caused design costs of new chips to sky-rocket to hundreds of millions of dollars perdesign. It has also become impractical to verifyall logic on new chip designs containing hun-dreds of millions of logic gates due to the combi-natorial nature of the verification process.Finally, with the move to smaller feature sizes forfuture chips, the likelihood of chip defects willcontinue to increase, and any defect makes thecore that contains it non-functional. Larger andmore complex CPU logic designs place a higherpenalty on such defects, thereby lowering chip


25

20

15

10

5

02001 2003 2005 2007 2009 2011 2013

Year

Cloc

k Ra

te (G

Hz)

Intel Single Core

2005 Roadmap

2007 Roadmap

Intel Multicore

IT Roadmap Semiconductors

Figure 3. Even some experts in the field were taken by surprise with the sudden end ofclock speed scaling. This graph shows the roadmap for processor performance from theInternational Technology Roadmap for Semiconductors (ITRS), which has been remarkablyaccurate in their predictions, until recently. In 2005, the group predicted clock rates ofover 12 GHz by 2009, but the 2007 roadmap dramatically curtailed those predictions andput them close to what Intel, among other companies, were producing by then.

New algorithms andprogramming models willneed to stay ahead of awave of geometricallyincreasing systemconcurrency—a tsunami of parallelism.

ILLUS

TRA

TION: A. TO

VEYS

OU

RC

E: D. P

ATTER

SO

N, UC

–BER

KELEY


yield. Industry can and does sell chips in whicha core is disabled, but such chips are more attrac-tive if the missing core is a small fraction of theoverall performance. These key problems ulti-mately conspire to limit the performance andpracticality of extrapolating past design tech-niques to future chip designs, regardless ofwhether the logic implements some exotic non-von Neumann architecture, or a more conven-tional approach.

In the view of the Berkeley team, here are theremedies for these problems:

•Power—parallelism is an energy-efficient wayto achieve performance. Many simple coresoffer higher performance per unit area for parallel codes than a comparable designemploying smaller numbers of complex cores.

•Design Cost—the behavior of a smaller, simpler processing element is much easier topredict within existing electronic design-automation workflows and more amenable toformal verification. Lower complexity makesthe chip more economical to design and produce.

•Defect Tolerance—smaller processing elementsprovide an economical way to improve defecttolerance by providing many redundant coresthat can be turned off if there are defects. Forexample, the Cisco Metro chip contains 188cores with four redundant processor cores perdie. The STI Cell processor has eight cores, butonly six are enabled in its mainstream consumerapplication—the Sony PlayStation 3—in orderto provide additional redundancy to better tolerate defects.

Getting Around the Constraints: Manycore versus MulticoreThe industry buzzword “multicore” captures theplan of doubling the number of standard coresper die with every semiconductor process gener-ation starting with a single processor. Multicorewill obviously help multiprogrammed work-loads, which contain a mix of independentsequential tasks, and prevent further degradationof individual task performance. But how will indi-vidual tasks become faster? Switching fromsequential to modestly parallel computing willmake programming much more difficult withoutrewarding this greater effort with a dramaticimprovement in power-performance. Hence,multicore is unlikely to be the ideal answer.

The alternative approach moving forward is toadopt the “manycore” trajectory, which employssimpler cores running at modestly lower clock

frequencies. Rather than progressing from two tofour to eight cores with the multicore approach,a manycore design would start with hundreds ofcores and progress geometrically to thousands ofcores over time. Figure 4 shows that moving toa simpler core design results in modestly lowerclock frequencies, but has enormous benefits inpower consumption and chip surface area. Evenif you presume that the simpler core will offeronly one-third the computational efficiency of themore complex out-of-order cores, a manycoredesign would still be an order of magnitude morepower- and area-efficient in terms of sustainedperformance.

Consumer Electronics Core Design Meets High-End ComputingThe approach of using simpler lower-frequencycore designs has been used for many years by theembedded-computing industry to improve bat-tery life, lower design costs, and reduce time tomarket for consumer electronics. In the past thedesign targets for embedded applications werenearly the opposite of the performance-drivenrequirements of high-end computing. However,the needs of the high-end computing market haveconverged with the design motivation of theembedded-computing industry as a result of theircommon need to improve energy efficiency andreduce design costs. With regard to energy effi-ciency, the embedded-computing industry hasthe most accumulated expertise and technology.


Xtensa ! 3

ARMTensilicaDP

Intel Core2

MC

LSD

ISU

IFU

IDU

FNU FPU

L3 directory control

Power5 (Server)

Intel Core2 sc (Laptop)

ARM Cortex A8 (Automobiles)

Tensilica DP (Cell Phones/Printers)

Tensilica Xtensa (Cisco Router)

– 389 mm2

– 120 W @ 1,900 MHz

– 130 mm2

– 15 W @ 1,000 MHz

– 5 mm2

– 0.8 W @ 800 MHz

– 0.8 mm2

– 0.09 W @ 600 MHz

– 0.32 mm2 for 3!– 0.05 W @ 600 MHz

Power5

Figure 4. The diagram shows the relative size and power dissipation of different CPUcore architectures. Simpler processor cores require far less surface area and power withonly a modest drop in clock frequency. Even if measured by sustained performance onapplications, the power efficiency and performance per unit area is significantly betterwhen using the simpler cores.

The needs of the high-endcomputing market haveconverged with the designmotivation of theembedded-computingindustry as a result oftheir common need toimprove energy efficiencyand reduce design costs.

D. P

ATTER

SO

N, UC

–BER

KELEY



Whereas past design innovations in high-endcomputing, such as superscalar execution andout-of-order instruction pipelines, trickled downto consumer applications on PCs, we are startingto see innovations that emerged in the embeddedspace trickling up into high-end server designs.This flow of innovation is likely to increase in thefuture.

Designing a core that does no more than youneed it to do makes sense and the tools used todesign embedded chips revolve around tailoringthe core design for the specific application. Thisreally changes the meaning of “commodity com-puting technology,” from repurposing a completecomputer node designed for a broader desktopmarket to repurposing a core designed for abroader consumer electronics market—it takesthe meaning of commodity and moves it onto thechip.

The Revolution is Already UnderwayThere are already examples of the convergencebetween embedded computing and HPC in thedesign of the Blue Gene and SiCortex supercom-puters, which are based on embedded-processorcores that are more typically seen in automobiles,cell phones, and toaster ovens. Parallelism withconcurrencies that have formerly been associatedwith HPC applications are already emerging inmainstream embedded applications. The Metrochip in new Cisco CRS-1 router contains 188 gen-eral-purpose Tensilica cores, and has supplantedCisco’s previous approach of employing customApplication-Specific Integrated Circuits (ASICs)for the same purpose.

Surprisingly, the performance and power effi-ciency of the Metro for its application are com-petitive with full-custom ASIC, which themselves

are more power- and area-efficient than could beachieved using FPGAs (dimming hopes thatFPGAs offer a more energy efficient approach tocomputation). The Motorola Razor cell phonealso contains eight Tensilica cores. The NVidiaG80 (CUDA) GPU replaces the semi-custompipelines of previous generation GPUs with 128more general-purpose CPU cores. The G80 inparticular heralds the convergence of manycorewith mainstream computing applications.Whereas traditional GPGPUs have a remarkablyobtuse programming model involving drawingan image of your data to the framebuffer (thescreen), the G80’s more general-purpose corescan be programmed using more conventional Ccode and will soon support IEEE-standard dou-ble-precision arithmetic.

The motivation for using more general-pur-pose cores is the increasing role of GPUs for accel-erating commonly required non-graphical gamecalculations such as artificial intelligence (AI) forcharacters in games, object interactions, and evenmodels for physical processes. Companies suchas AISeek’s Intia and Ageia’s (recently acquired byNVidia) PhysX have implemented game physicsacceleration on GPUs that use algorithms that arevery similar to those used for scientific simulationand modeling (Further Reading, p49). ATI(recently acquired by AMD) has proposed offer-ing GPUs that share the same cache-coherentHyperTransport fabric of their mainstream CPUs.Intel’s experimental Polaris chip uses 80 simplerCPU cores on a single chip to deliver oneteraflop/s of peak performance while consumingonly 65 watts and this design experience may feedinto future GPU designs from Intel. Both Intel andAMD roadmaps indicate that tighter integrationbetween GPUs and CPUs is the likely path towardintroducing manycore processing to mainstreamconsumer applications on desktop and laptopcomputers.

Step Aside PCs—Consumer Electronics Are Now Driving CPU InnovationTaking a historical perspective on HPC systemdesign, Bell’s Law is likely as important to thedevelopment of HPC system design as Moore’sLaw. Bell’s Law is the corollary of Moore’s Lawand holds that by maintaining the same designcomplexity you can halve the size and costs of thechip every 18 months. The TOP500 project hasdocumented a steady progression from the earlyyears of supercomputing, from exotic and spe-cialized designs towards clusters composed ofcomponents derived from consumer PC applica-tions. The enormous volumes and profits of desk-top PC technology led to huge cost/performancebenefits for employing clusters of desktop CPU


2.5

Mar

ket i

n Ja

pan

($bi

llion

s)

2.0

1.5

1.0

0.5

02001 2002 2003 2004 2005

Digital

Analog

DSC

DVD

PC

TV

YearFigure 5. The graph shows the declining influence of PC microprocessors on the overallrevenue share in microprocessor-based electronic designs.

Taking a historicalperspective on HPCsystem design, Bell’s Lawis likely as important tothe development of HPCsystem design as Moore’s Law.

ILLUS

TRA

TION: A. TO

VEYS

OU

RC

E: D. P

ATTER

SO

N, UC

–BER

KELEY


designs for scientific applications despite thelower computational efficiency. As we move in tothe new century, the center of gravity for the mar-ket in terms of unit volume and profit has shiftedto handheld consumer electronics. This move-ment has some significant implications for thefuture of the HPC.

As can be seen in figure 5, the market share ofthe consumer electronics applications for CPUssurpassed that of the desktop PC industry in 2003and the disparity continues to grow. Shortly after2003, revenue in the PC business flattened (figure6), and IBM subsequently sold off its desktop andportable personal computer units. During thatsame period, Apple moved into the portable elec-tronic music player market with the iPod, theninto the cellular phone business and eventuallydropped “Computer” from its corporate name.This may be the most dramatic twist yet for theHPC industry if these trends continue (and theylikely will). Although the desktop computingindustry has been leading a major upheaval inHPC system design over the past decade (themovement to clusters based on desktop technol-ogy), that industry segment is no longer in the dri-ver’s seat for the next revolution in CPU design.Rather, the market volume, and hence design tar-gets, are now driven by the needs of handheldconsumer electronics such as digital cameras, cellphones, and other devices based on embeddedprocessors.

The next generation of desktop systems, andconsequent HPC system designs, will borrowmany design elements and components from theconsumer electronics devices. Namely, the baseunit of integration for constructing new chipdesigns targeted at different applications will bethe CPU cores derived from embedded applica-tions rather than the transistor. Simpler CPUcores may be combined together with some spe-cialization to HPC applications (such as differentmemory controllers and double-precision float-ing point), just as transistors are currentlyarranged to form current CPU designs. Indeed,within the next three years, there will be many-core chip designs that contain more than 2,000CPU cores, which is very close to the number oftransistors that was used in the very first Intel4004 CPU. This led Chris Rowen, CEO of Tensil-ica, to describe the new design trend by saying“the processor is the new transistor.” This is abrave new world, and we do not know where itwill take us.

Ramifications for the HPC EcosystemGiven current trends, petascale systems deliveredin 2011 are projected to have the following char-acteristics:

•Systems will contain between 400,000 and1,500,00 processors (50,000 to 200,000 sockets).Each socket in the system will be a chip thatcontains multiple cores.

•In 2011 these multicore chips will containbetween 8 and 32 conventional processor coresper chip. Technology based on manycore willemploy hundreds to thousands of CPU coresper chip. Consider that the Cisco CRS-1 routercurrently employs a chip containing 188processor cores using conventional siliconprocess technology, so “1,000 cores on a chip” isnot as far off as one might expect. A System onChip (SoC) design, such as Blue Gene or SiCortex,may still use the simpler embedded cores(manycore design point), but sacrifice raw corecount to integrate more system services onto thechip (such as interconnect fabric and memorycontrollers).

•As microprocessor manufacturers move theirdesign targets from peak clock rate to reducingpower consumption and packing more coresper chip, there will be a commensurate trendtowards simplifying the core design. This isalready evidenced by the architecture of theIntel Core-Duo processors that use pipelinesthat are considerably shorter than those of itspredecessor, the Pentium4. The trend towardssimpler processor cores will likely simplify performance tuning, but will also result inlower sustained performance per core as out-of-order instruction processing is droppedin favor of smaller, less complex and less power-hungry in-order core designs. Ultimately


1,000B

100B

10B

1B1980 1985 1990 1995 2000 2005 2010

YearRe

venu

e ($

)

1,000M

Ship

men

t (Un

its)

100M

10M

Source: IDC

Units

IBM StartedPC Business

IBM Sold PCBusiness to Lenovo

Revenue1981

2005

Brief History of PC1975197819811985

Altair/MITSApple IIIBM PC(MSDOS+i8088)Windows 1.0

Figure 6. This graph shows how revenues for IBM’s PC business have flattened inresponse to the increasing dominance of consumer electronics applications in theelectronics industry.

ILLUS

TRA

TION: A. TO

VEYS

OU

RC

E: D. P

ATTER

SO

N, UC

–BER

KELEY



this trend portends a convergence between themanycore and multicore design points.

• As the number of cores per socket increase,memory will become proportionally moreexpensive and power hungry in comparison tothe processors. Consequently, cost and power-efficiency considerations will push memorybalance (in terms of the quantity of memoryput on each node) from the current nominallevel of 0.5 bytes of DRAM memory per peakflop, down below 0.1 bytes/flop (possibly evenless than 0.02 bytes/flop).

•Cost scaling issues will force fully-connectedinterconnect topologies, such as the fat-tree andcrossbar, to be gradually supplanted at the highend by lower-degree interconnects such as then-dimensional torii, meshes, or alternativeapproaches, such as hierarchical fully-con-nected graphs.

The consequence will be that HPC softwaredesigners must take interconnect topology andassociated increased non-uniformity in band-width and latency into account for both algo-rithm design and job mapping. Currently, theinterconnect topology is typically ignored bymainstream code and algorithm implementa-tions. Blue Gene/L programmers already have tograpple with the topology mapping issues—it ismerely a harbinger of the broader challenges fac-ing all HPC programmers in the near future. Pro-gramming models that continue to present theillusion of a flat system communication cost willoffer increasingly poor computational efficiency.

Whether the HPC community is convinced ornot that the future is in multicore or manycore

technology, the industry has already retooled tomove in the direction of geometrically scalingparallelism. If the levels of concurrency that resultfrom a transition directly to manycore appeardaunting, the trends on the TOP500 list in figure7 show that within just three years multicore willtake us to the same breathtaking levels of paral-lelism. With either path, the HPC communityfaces an unprecedented challenge to existingdesign practices for system design, OS design, andour programming model. The issue of program-mability looms large as computer scientists won-der how on Earth will they stay abreast ofexponentially increasing parallelism? If no onecan program a manycore computer system pro-ductively, efficiently and correctly, then there islittle benefit to this approach for maintaining his-torical per-socket performance improvements forfuture HPC systems.

Time to Get with the ProgrammingKeeping abreast of geometrically increasing con-currency is certainly the most daunting challengeas we move beyond the petascale in the next fewyears. Most HPC algorithms and software imple-mentations are built on the premise that concur-rency is coarse-grained and its degree willcontinue to increase modestly, just as it has overthe past 15 years. The exponentially increasingconcurrency throws all of those assumptions intoquestion. At a minimum, we will need to embarkupon a campaign of re-engineering our softwareinfrastructure that may be as dramatic and broadin scope as the transition from vector systems tomassively parallel processors (MPPs) thatoccurred in the early 1990s.

However, the heir apparent to current pro-gramming practice is not obvious, in large part


350,000

300,000

250,000

200,000

150,000

100,000

50,0000

Proc

esso

rs

Jun 93

Dec 93Jun

94

Dec 94Jun

95

Dec 95Jun

96

Dec 96Jun

97

Dec 97Jun

98

Dec 98Jun

99

Dec 99Jun

00

Dec 00Jun

01

Dec 01Jun

02

Dec 02Jun

03

Dec 03Jun

04

Dec 04Jun

05

Dec 05Jun

06

List

Total # of Processors in Top15

Figure 7. The graph shows the dramatic increase in system concurrency for the Top 15 systems in the annual TOP500list of HPC systems. Even if BG/L systems are removed from consideration, the inflection point of system concurrency isstill quite dramatic.

Keeping abreast ofgeometrically increasingconcurrency is certainlythe most dauntingchallenge as we movebeyond the petascale inthe next few years.

ILLUS

TRA

TION: A. TO

VEYS

OU

RC

E: D. P

ATTER

SO

N, UC

–BER

KELEY


because the architectural targets are not clear.There are two major questions, one involving off-chip memory bandwidth and access mecha-nisms, and the second being organization of theon-chip cores. Some of the initial forays intomulticore have kept both of theses issues at bay.For example, the upgrades from dual- to quad-core processors at NERSC and ORNL (figure 8)upgraded memory density and bandwidth at thesame time, and the cores were familiar x86 archi-tectures. So using each core as an MPI processorwas a reasonable strategy both before and afterthe upgrades. As with the introduction of shared-memory nodes into clusters in the late 1990s, themessage-passing model continued to work wellboth within and between nodes, which hasallowed programmers to avoid rewriting theircodes to take advantage of the changes. The rea-son? The architectures were still relatively heftycores with their own caches, floating-point unitsand superscalar support, so programs were bestdesigned with coarse-grained parallelism andgood locality to avoid cache coherence traffic.But the incremental changes to chip architectureseen to date may give HPC programmers a falsesense of security.

Limited off-chip bandwidth—sometimesreferred to as the “memory wall”—has been aproblem for memory-intensive computationseven on single-core chips that predate the moveto multicore. The shift to multiple cores is signif-icant only because it continues the increase intotal on-chip computational capability after theend of clock frequency scaling allowing us to con-tinue hurtling towards the memory wall. Thememory bandwidth problem is as much eco-nomic as technical, so the HPC community willcontinue to be dependent on how the commod-

ity market will move. Current designs are unlikelyto sustain the per core bandwidth of previousgenerations, especially for irregular memoryaccess patterns that arise in many importantapplications. Even the internal architecture ofmemory chips is not organized for efficient irreg-ular word-oriented accesses.

There are promising avenues to sustain thegrowth in bandwidth, such as chip stacking andphotonic interfaces, but this will require eco-nomic pressure from the mass market to makeit happen. That pressure will depend on whatapplications will drive consumers to buy many-core chips and whether those applications arewritten in a manner that requires high bandwidthin a form that is usable by the HPC community.A new parallel computing laboratory at Berkeleyis working on parallel applications for handhelddevices and laptops, and these have revealed someof the same memory bandwidth needs as scien-tific applications, which means the HPC commu-nity will not be on its own in pushing for morebandwidth. For both HPC and consumer applica-tions, the first problem is to make the bandwidthavailable to the applications through a set of opti-mizations that mask the other memory systemissue—latency. The Berkeley group applies auto-matic performance tuning techniques to maxi-mize effective use of the bandwidth throughlatency-masking techniques such as pre-fetching,and is planning on technology innovations tosupply the necessary bandwidth. But to hedgetheir bets on the bandwidth problem, they arealso developing a new class of communication-avoiding algorithms that reduce memory band-width requirements by rethinking the algorithmsfrom the numerics down to the memory accesses(sidebar “Dwarfs” p48).


Figure 8. In the past year the Cray XT supercomputers at ORNL (Jaguar, left) and NERSC (Franklin, right) have both been upgraded from dual- to quad-core processors.

The memory bandwidthproblem is as mucheconomic as technical, sothe HPC community willcontinue to be dependenton how the commoditymarket will move.

OR

NL (LEFT); N

ERS

C (R

IGH

T)



The architectural organization question is evenmore critical to the future of MPI. The notion ofa core as a traditional processor may not be theright building block. Instead, the exponentiallygrowing concurrency may come from widerSIMD or vector units, multithreaded architecturesthat share much of the hardware state betweenthreads, very simple processor cores, or proces-sors with software-managed memory hierar-chies. MPI expresses coarse-grained parallelismbut is not helpful in code vectorization (orSIMDization). Even if there are many full-fledgedcores on a chip, the memory overhead of MPImay make it an impractical model for manycore.Consider a simple nearest-neighbor computation

on a 3D mesh—if each of the manycores holdsa 10×10×10 subgrid, roughly half of the pointsare on the surface and will be replicated on neigh-boring cores. Similarly, any globally shared statewill have to be replicated across all cores, becauseMPI does not permit direct access to sharedcopies. This has reopened consideration of thehybrid OpenMP+MPI programming model,although that model has shown more failuresthan successes over the past decade. The prob-lems with the hybrid model have included thelack of locality control in OpenMP, and the inter-action between the multiple threads withinOpenMP communicating through a shared net-work resource to other nodes. Even if these


Benchmarks are useful tools for measuringprogress on computing performance overgenerations of machines. However,benchmarks can also inhibit progress becausethey are reflections of current programmingmodels and hardware technology. Given thecurrent trends in computer architecture, it isclear that parallel programming cannotcontinue along the incremental path ofimproving existing benchmarks and codebases. Enabling a fundamental rethinking ofboth hardware and programming modelsrequires a move away from concreteimplementations of applications (benchmarks)and towards a higher-level patterns ofcomputation and communication to drive ourdecisions about future programming modelsand parallel hardware architectures. Patternsare conceptual tools that help a programmerto reason about a software project anddevelop an architecture, but they are notthemselves implementation mechanisms forproducing code.

Tim Mattson observes in his book onparallel programming that patterns are notinvented, but mined from successful softwareapplications. We began our mining expeditionwith Philip Collela’s observations aboutalgorithms in parallel computing, nicknamedthe Seven Dwarfs, that gave us the firstinsights into the structural patterns observed inHPC codes. Then, over a period of two years,we surveyed other application areas, includingembedded systems, general-purposecomputing (SPEC benchmarks), databases,

games, artificial intelligence/machine learning,and computer-aided design of integratedcircuits to search for more patterns ofcomputation. Through this process the sevendwarfs grew to thirteen. More dwarfs may needto be added to the list. Nevertheless, we weresurprised that we only needed to add sixdwarfs to cover such a broad range ofimportant applications.

Meet the Next Six DwarfsFigure 9 shows the original seven dwarfs, plussome that were added as a result of theteam’s investigations. Although 12 of the 13dwarfs possess some form of parallelism, finitestate machines (FSMs) look to be a challenge,which is why we made them the last dwarf.Perhaps FSMs will prove to be embarrassinglysequential. If it is still important and does notyield to innovation in parallelism, that will bedisappointing, but perhaps the right long-termsolution is to change the algorithmicapproach. In the era of multicore andmanycore, popular algorithms from thesequential computing era may fade inpopularity.

In any case, the point of the 13 dwarfs isnot to identify the low-hanging fruit that arehighly parallel. The point is to identify thekernels that are the core computation andcommunication for important applications inthe upcoming decade, independent of theamount of parallelism. To developprogramming systems and architectures thatwill run future applications as efficiently as

possible, we must learn the limitations as wellas the opportunities. We note, however, thatinefficiency on embarrassingly parallel codecould be just as plausible a reason for thefailure of a future architecture as weakness onembarrassingly sequential code.

From Dwarfs to PatternsOver time, we have worked to take patterns ofparallel programming embodied in the dwarfsto create a formal pattern language. A book byTimothy Mattson et al., Patterns for ParallelProgramming, was the first such attempt toformalize elements of parallel programminginto such a pattern language. Thus the dwarfsform a critical link between the structuralpatterns described in Mattson’s book and theidioms for parallelization developed throughyears of experimentation by the softwaredevelopment community.

The patterns define the structure of aprogram, but they do not indicate what isactually computed, whereas the computationalpatterns embody the generalized idiom forparallelism used to implement these structuralpatterns. Using the analogy from civilengineering, structural patterns describe afactory’s physical structure and generalworkflow. Computational patterns describethe factory’s machinery, flow of resources, andwork-products. Structural and computationalpatterns can be combined in the “patternlanguage” to provide a template forarchitecting arbitrarily complex parallelsoftware systems.

Dwarfs

If the HPC communitycannot succeed inselecting a scalableprogramming model tocarry us for the next 15years, we will be forced toreckon with multiplephases of ground-upsoftware rewrites, or ahard limit to the useableperformance of futuresystems.


deficiencies can be overcome, OpenMP’s use ofserial execution as the default between loops willbe increasing stymied by Amdahl’s law at higheron-chip concurrencies.

If we simply treat a multicore chip as a tradi-tional shared memory multiprocessor—or worsestill, treat it as a flat MPI machine—then we maymiss opportunities for new architectures andalgorithm designs that can exploit these new fea-tures.

Settling on a stable and performance-portableprogramming model that accommodates thenew course of evolution for chip multiprocessorsis of utmost importance as we move forward. Ifthe HPC community cannot succeed in selectinga scalable programming model to carry us for thenext 15 years, we will be forced to reckon withmultiple phases of ground-up software rewrites,or a hard limit to the useable performance offuture systems. Consequently, the programmingmodel for manycore systems is the leading chal-lenge for future systems.

Manycore OpportunitiesMost of the consternation regarding the move tomulticore reflects an uncertainty that we will beable to extract sufficient parallelism from ourexisting algorithms. This concern is most pro-nounced in the desktop software developmentmarket, where parallelism within a single appli-cation is almost nonexistent, and the perform-ance gains that have enabled each advance infeatures will no longer be forthcoming from thesingle processor. The HPC community has con-cerns of its own, remembering well the pain

associated with rewriting their codes to movefrom vector supercomputer to commodity-based cluster architectures. The vector machineswere designed specifically for HPC, whereascommodity microprocessors and associatedsoftware tools lacked many of the features theyhad come to rely upon. But there is a real oppor-tunity for HPC experts to step forward and helpdrive the innovations in parallel algorithms,architectures, and software needed to face themulticore revolution. While the application driv-ers of the multicore designs will be different fromthe scientific applications in HPC, the concernsover programming effectiveness, performance,scalability and efficiency and the techniquesneeded to achieve these will largely be shared.Rather than sitting back to await the inevitablerevolution in hardware and software, the HPCcommunity has an opportunity to look outsideits normal scientific computing space and ensurethat the years of hard-won lessons in effective useof parallelism will be shared with the broadercommunity. !

Contributors John Shalf and Jon Bashor (technicalwriter), LBNL; Dave Patterson, Krste Asanovic, and Katherine Yelick, LBNL and UC–Berkeley; Kurt Keutzer,UC–Berkeley; and Tim Mattson, Intel

Further ReadingAISeekhttp://www.aiseek.com/

PhysXhttp://www.nvidia.com/object/physx_new.html


Finite State Machines

Circuits

Graph Algorithms

Structured Grid

Dense Matrix

Sparse Matrix

Spectral (FFT)

Dynamic Programming

Particle Methods

Backtrack/Branch and Bound

Graphical Models

Unstructured Grid

Embed

SPEC

DB Games

ML

CAD

HPC Health Image Speech Music Browser

Figure 9. An illustrated summary of the growing list of dwarfs—algorithmic methods that each capture a pattern of computation and communication—across areas of application.

There is a real opportunityfor HPC experts to stepforward and help drive theinnovations in parallelalgorithms, architectures,and software needed toface the multicorerevolution.

ILLUS

TRA

TION: A. TO

VEY


The Manycore Revolution: Will HPC Lead or Follow?people.eecs.berkeley.edu/~krste/papers/manycore-scidac2009.pdf · The incremental path towards “multi-core” chips (two, four,

Documents