Top Banner

of 241

Auto-tuning Performance on Multicore Computers ??1 Abstract Auto-tuning Performance on Multicore Computers by Samuel Webb Williams Doctor of Philosophy in Computer Science University

Apr 13, 2018

ReportDownload

Documents

phungdien

  • Auto-tuning Performance on Multicore Computers

    Samuel Webb Williams

    Electrical Engineering and Computer SciencesUniversity of California at Berkeley

    Technical Report No. UCB/EECS-2008-164

    http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-164.html

    December 17, 2008

  • Copyright 2008, by the author(s).All rights reserved.

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.

  • Auto-tuning Performance on Multicore Computers

    by

    Samuel Webb Williams

    B.S. (Southern Methodist University) 1999B.S. (Southern Methodist University) 1999

    M.S. (University of California, Berkeley) 2003

    A dissertation submitted in partial satisfaction of the

    requirements for the degree of

    Doctor of Philosophy

    in

    Computer Science

    in the

    GRADUATE DIVISION

    of the

    UNIVERSITY OF CALIFORNIA, BERKELEY

    Committee in charge:Professor David A. Patterson, Chair

    Professor Katherine YelickProfessor Sara McMains

    Fall 2008

  • Auto-tuning Performance on Multicore Computers

    Copyright 2008by

    Samuel Webb Williams

  • 1

    AbstractAuto-tuning Performance on Multicore Computers

    bySamuel Webb Williams

    Doctor of Philosophy in Computer ScienceUniversity of California, Berkeley

    Professor David A. Patterson, Chair

    For the last decade, the exponential potential of Moores Law has been squan-dered in the effort to increase single thread performance, which is now limited by thememory, instruction, and power walls. In response, the computing industry has boldlyplaced its hopes on the multicore gambit. That is, abandon instruction-level parallelismand frequency-scaling in favor of the exponential scaling of the number of compute coresper microprocessor. The massive thread-level parallelism results in tremendous potentialperformance, but demands efficient parallel programming a task existing software toolsare ill-equipped for. We desire performance portability the ability to write a programonce and not only have it deliver good performance on the development computer, but onall multicore computers today and tomorrow.

    This thesis accepts for fact that multicore is the basis for all future computers.Furthermore, we regiment our study by organizing it around the computational patternsand motifs as set forth in the Berkeley View. Although domain experts may be extremelyknowledgeable on the mathematics and algorithms of their fields, they often lack the de-tailed computer architecture knowledge required to achieve high performance. Forthcomingheterogeneous architectures will exacerbate the problem for everyone. Thus, we extendthe auto-tuning approach to program optimization and performance portability to themenagerie of multicore computers. In an automated fashion, an auto-tuner will explorethe optimization space for a particular computational kernel of a motif on a particular com-puter. In doing so, it will determine the best combination of algorithm, implementation,and data structure for the combination of architecture and input data.

    We implement and evaluate auto-tuners for two important kernels: Lattice Boltz-mann Magnetohydrodynamics (LBMHD) and sparse matrix-vector multiplication (SpMV).They are representative of two of the computational motifs: structured grids and sparselinear algebra. To demonstrate the performance portability that our auto-tuners deliver, weselected an extremely wide range of architectures as an experimental test bed. These includeconventional dual- and quad-core superscalar x86 processors both with and without inte-grated memory controllers. We also include the rather unconventional chip multithreaded(CMT) Sun Niagara2 (Victoria Falls) and the heterogeneous, local store-based IBM CellBroadband Engine. In some experiments we sacrifice the performance portability of a com-mon C representation, by creating ISA-specific auto-tuned versions of these kernels to gainarchitectural insight. To quantify our success, we created the Roofline model to perform abound and bottleneck analysis for each kernel-architecture combination.

    Despite the common wisdom that LBMHD and SpMV are memory bandwidth-bound, and thus nothing can be done to improve performance, we show that auto-tuning

  • 2

    consistently delivers speedups in excess of 3 across all multicore computers except thememory-bound Intel Clovertown, where the benefit was as little as 1.5. The Cell pro-cessor, with its explicitly managed memory hierarchy, showed far more dramatic speedupsof between 20 and 130. The auto-tuners includes both architecture-independent opti-mizations based solely on source code transformations and high-level kernel knowledge, aswell as architecture-specific optimizations like the explicit use of single instruction, multipledata (SIMD) extensions or the use Cells DMA-based memory operations. We observe thatthe these ISA-specific optimizations are becoming increasingly important as architecturesevolve.

    Professor David A. PattersonDissertation Committee Chair

  • i

    To those who always believed in me,even when I didnt.

  • ii

    Contents

    List of Figures vii

    List of Tables xi

    List of symbols xiii

    1 Introduction 1

    2 Motivation and Background 52.1 Why Optimize for Performance? . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Trends in Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.2.1 Moores Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Frequency and Power . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.3 Single Thread Performance . . . . . . . . . . . . . . . . . . . . . . . 72.2.4 The Multicore Gambit . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.5 DRAM Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.6 DRAM Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.7 Cache Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.8 Productivity, Programmers, and Performance . . . . . . . . . . . . . 10

    2.3 Dwarfs, Patterns, and Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1 The Berkeley View . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2 The Case for Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.3 The Case for Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.4 The Case for Auto-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.1 An Introduction to Auto-tuning . . . . . . . . . . . . . . . . . . . . 132.4.2 Auto-tuning the Dense Linear Algebra Motif . . . . . . . . . . . . . 172.4.3 Auto-tuning the Spectral Motif . . . . . . . . . . . . . . . . . . . . . 192.4.4 Auto-tuning the Particle Method Motif . . . . . . . . . . . . . . . . 20

    2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    3 Experimental Setup 233.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3.1.1 Computers Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.1.2 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.1.3 Interconnection Topology . . . . . . . . . . . . . . . . . . . . . . . . 29

  • iii

    3.1.4 Coping with Memory Latency . . . . . . . . . . . . . . . . . . . . . . 313.1.5 Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    3.2 Programming Models, Languages and Tools . . . . . . . . . . . . . . . . . . 363.2.1 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.2 Strong Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.3 Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.4 Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.5 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.6 Performance Measurement Methodology . . . . . . . . . . . . . . . . 403.2.7 Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    4 Roofline Performance Model 464.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2 Performance Metrics and Related Terms . . . . . . . . . . . . . . . . . . . . 48

    4.2.1 Work vs. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2.2 Arithmetic Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    4.3 Nave Roofline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4 Expanding upon Communication . . . . . . . . . . . . . . . . . . . . . . . . 52

    4.4.1 Cache Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4.2 DRAM Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4.3 DRAM Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4.4 Cache Line Spatial Locality . . . . . . . . . . . . . . . . . . . . . . . 544.4.5 Putting It Together: Bandwidth Ceilings . . . . . . . . . . . . . . . 54

    4.5 Expanding upon Computation . . . . . . . . . . . . . . . . . . . . . . . . . 574.5.1 In-Core Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.5.2 Instruction Mix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.5.3 Putting It Together: In-Core Ceilings . . . . . . . . .