A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

A STUDY OF

PERFORMANCE AND POWER TRADE-OFFS ON

HETEROGENEOUS ARCHITECTURES USING

THE QILIN PROGRAMMING MODEL

Submitted by: Gurbinder Singh Gill CSE, B.Tech.Year 4th IIT Roorkee, India

GUIDES: DR. RICHARD VUDUC

DR. HYESOON KIM

Georgia Institute of Technology, Georgia, USA

CRUISE INTERNSHIP 2011

The Problem:

How do changes in power affect performance?

�Looking for Relationship between power and performance

One of the “Dark Silicon” papers proposed that:

Future processors are likely to be power-limited, meaning there will not be enough total power to turn on the entire chip.

Why?

�Moore's Law

Moore’s Law

This phenomenon will in turn increase the power required by the chip.

� Power increases with both the number of transistors and the supply voltage.

How can heterogeneous chips can help ?

Why?

Today, one widespread example of a mix of processing components is the use of both CPUs and GPUs in a system. Therefore, understanding how power and performance trade-off in such a system will provide a clue about what will need

to happen in future processors.

Tapping the full potential of Heterogeneous Architectures

�Getting the peak performance/power ratio is not a trivial task

�To take into account these differences it is usual to rely on the programmer to specify the mapping of the problem manually

� Which makes it labor intensive

� Non adaptable

Therefore the step that maps the computations to the processing elements must be as automated as

possible.

Qilin :

is one such experimental system that tends to fully automate the approach of mapping from computation to processing elements using run-time adaptation.

It currently focuses on CPU+GPU platforms, but the adaptive mapping approach it follows is applicable to heterogeneous platforms in general.

What is QILIN ?

�Qilin provides an API to programmers for writing parallelizable operations.

�The compiler is alleviated from the difficult job of extracting parallelism from serial code.

�Compiler can focus on performance tuning. �Beneath the API layer is the Qilin system layer, which

consists of a dynamic compiler and its code cache, a number of libraries, a set of development tools, and a scheduler.

�compiler dynamically translates the API calls into native machine codes and decides the near-optimal mapping from computations to processing elements using an adaptive algorithm.

The Qilin Programming System

The Two Qilin Programming Styles

�Stream-API approach: where the programmer solely uses the

Qilin stream API which implements common data-parallel operations including elementwise, reduction, and linear algebra operations on qilin defined arrays called Qarrays.

�Threading-API approach: in which the programmer provides

the parallel implementations of computations in the threading APIs on which Qilin is built (i.e., TBB on the CPU side and CUDA on the GPU side for our current implementation).

My work was to make some of the benchmarks using Threading-API approach and measure the results of those benchmarks for different type of GPU’s.

Writing Benchmarks for QILIN For writing Qilin benchmarks you need to provide the CUDA version for GPU’s and OpenMP version for CPU’s.

Following are the main steps your benchmark must have: main():

Qilin_Init //initiating Qilin Qilin_RegisterType //type of registers you want to use PreKernel //to read the command line arguments MakeQArrayOp //define CPU and GPU operations for Qilin Create1D or Create2D (for each parameter type) //create Qilin arrays MakeQArrayOpArg (for each previous arg) push_back Push these args to a list reference runs //number of times you want to run the benchmarks ApplyQArrayOp // apply the data to the CPU and GPU functions ToNormalArray //convert back to normal array CompareReferenceResults PostKernel Qilin_Shutdown //shutdown qilin

void CpuFilter(QArray<Pixel> qSrc, QArray<Pixel> qDst) {

Pixel* src_cpu = qSrc.NormalArray(), dst_cpu = qDst.NormalArray(); int height_cpu = qSrc.DimSize(0), width_cpu = qSrc.DimSize(1); // Filter implementation in TBB

}

void GpuFilter(QArray<Pixel> qSrc, QArray<Pixel> qDst) {

Pixel* src_gpu = qSrc.NormalArray(), dst_gpu = qDst.NormalArray(); int height_gpu = qSrc.DimSize(0), width_gpu = qSrc.DimSize(1); //Filter implementation in CUDA

}

void MyFilter(Pixel* src, Pixel* dst, int height, int width) {

// Create Qilin arrays from normal arrays QArray<Pixel> qSrc = QArray<Pixel>::Create2D(height, width, src); QArray<Pixel> qDst = QArray<Pixel>::Create2D(height, width, dst); // Define myFilter as an operation that glues CpuFilter() and GpuFilter() QArrayOp myFilter = MakeQArrayOp(“myFilter”, CpuFilter, GpuFilter); // Build the argument list for myFilter. QILIN_PARTITIONABLE means the // associated computation can be partitioned to run on both CPU and GPU. QArrayOpArgsList argList; argList.Insert(qSrc, QILIN_PARTITIONABLE); argList.Insert(qDst, QILIN_PARTITIONABLE); // Apply myFilter with argList using the default mapping scheme QArray<BOOL> qSuccess = ApplyQArrayOp(myFilter, argList, PE_SELECTOR_DEFAULT); // Convert from qSuccess[] to success, and this triggers the lazy evaluation BOOL success; qSuccess.ToNormalArray(&success, sizeof(BOOL)); }

Benchmarks for QILIN

�Binomial �Convolve �Gemv �Matrix Multiplication �Sepia �Smithwat �SVM �Kmeans clustering �LU Decomposition

Results

�Speedup: � On average, adaptive mapping is 69% faster than

always using the CPU and 33% faster than always using the GPU. This is true for majority of cases.

� QILIN also shows good results with varying Input size as well as adapts well with changing Hardware configuration like using different GPU’s.

We tested benchmarks with different number of CPU threads and also at various CPU frequencies.

Few results are: Graphic card GeForce GTS 450 with 8 CPU

threads and 1596000 CPU frequency.

0

2

4

6

8

10

12

14

16

0

5

10

15

20

Graphic card GeForce GTS 450 with 12 CPU threads and 1596000 CPU frequency.

Qilin_Binomial

0

5

10

15

20

0

5

10

15

20

qilin_matmul

0

1

2

3

4

5

6

7

qilin_sepia

0

2

4

6

8

Graphic card GTX 560 with 8 CPU threads and 1596000 CPU frequency.

Graphic card GTX 560 with 12 CPU threads and 1596000 CPU frequency.

0

5

10

15

20

25

0

5

10

15

20

25

30

0

2

4

6

8

0

5

10

15

20

25

0

5

10

15

20

25

30

0

1

2

3

4

5

6

7

8

Qilin_Binomial

qilin_matmul

qilin_matmul

Visualizing performance/power trade-offs

� Finding non-trivial relationship between power,

energy and time.

� measurement of Power used during the

computation for various benchmarks using external power meter.

Consider the "baseline" system to be 12 CPU cores (threads), each running at the maximum 2.4 GHz clock rate. All data is normalized to this baseline, which we might think of as "full

power CPU."

The question we are interested in is as follows:

Does adding a GPU while decreasing CPU threads/clock give a speedup while

reducing overall power?

GRAPHS

Each point represents some fraction offloaded to the CPU ( in 25% intervals) and some number of threads.

we are looking for configurations in the upper-left quadrant of each subplot.

An interesting case would be if we had a triangle (8threads) appear in the top row (lowest clock of 1.6 GHz) with a GPU offloading (any color but red) and in the upper-left quadrant (speedup with power savings).

Points to note:

�GPU offloading showing points showing about 2.3x speed with more than 10% reduction in Power (at 2.40GHz).

�There points showing 1.5x or more speedup at lowest clock rate (1.5GHz) with less than a 1.4x increase on power.

However,

Another way to look into power data:

Speedup vs Energyup

Heat maps

We plotted the heat maps from the data collected from the power meter.

� Heat maps for GTX580

� Heat maps for GeForce GTS450.

� Heat maps for GTX 560.

FUTURE WORK

�Trying more interesting benchmarks.

�Need more efficient and accurate methods for power measurement.

�Testing for different hardware configurations.

�extend Qilin's performance model to be able to predict the performance / power trade-offs observed in slides 17 and 18.

References: � C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on

heterogeneous multiprocessors with adaptive mapping. In Micro-42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 45–55, New York, NY, USA, 2009. ACM.

� Shuai Che, Jeremy W. Sheaffer, Michael Boyer, Lukasz G. Szafaryn, Liang Wang, and Kevin Skadron. 2010. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10) (IISWC '10).

� Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., & Burger, D. (2011). Dark Silicon and the End of Multicore Scaling. Proceedings of the 28th International Symposiumn Computer Architecture (ISCA). San Jose, CA, USA.

� Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010. An Improved Magma Gemm For Fermi Graphics Processing Units. Int. J. High Perform. Comput. Appl. 24, 4 (November 2010), 511-515.

Questions

Thank You

A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Technology

qilin arraysmakeqarrayoparg

shutdownshutdown qilin

qilin system layer

qilin programming systemqilin

gpu operations

gpu platforms

src qarray qdst

normal arrays qarray