Top Banner
A STUDY OF PERFORMANCE AND POWER TRADE-OFFS ON HETEROGENEOUS ARCHITECTURES USING THE QILIN PROGRAMMING MODEL Submitted by: Gurbinder Singh Gill CSE, B.Tech.Year 4th IIT Roorkee, India GUIDES: DR. RICHARD VUDUC DR. HYESOON KIM Georgia Institute of Technology, Georgia, USA CRUISE INTERNSHIP 2011
29

A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Jun 26, 2015

Download

Technology

Gurbinder Gill

This is related to the research work done at HPC Lab, GaTech in summer 2011 which was aimed at finding
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

A STUDY OF

PERFORMANCE AND POWER TRADE-OFFS ON

HETEROGENEOUS ARCHITECTURES USING

THE QILIN PROGRAMMING MODEL

Submitted by: Gurbinder Singh Gill CSE, B.Tech.Year 4th IIT Roorkee, India

GUIDES: DR. RICHARD VUDUC

DR. HYESOON KIM

Georgia Institute of Technology, Georgia, USA

CRUISE INTERNSHIP 2011

Page 2: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

The Problem:

How do changes in power affect performance?

�Looking for Relationship between power and performance

One of the “Dark Silicon” papers proposed that:

Future processors are likely to be power-limited, meaning there will not be enough total power to turn on the entire chip.

Why?

�Moore's Law

Moore’s Law

Page 3: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

This phenomenon will in turn increase the power required by the chip.

� Power increases with both the number of transistors and the supply voltage.

How can heterogeneous chips can help ?

Why?

Page 4: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Today, one widespread example of a mix of processing components is the use of both CPUs and GPUs in a system. Therefore, understanding how power and performance trade-off in such a system will provide a clue about what will need

to happen in future processors.

Page 5: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Tapping the full potential of Heterogeneous Architectures

�Getting the peak performance/power ratio is not a trivial task

�To take into account these differences it is usual to rely on the programmer to specify the mapping of the problem manually

� Which makes it labor intensive

� Non adaptable

Therefore the step that maps the computations to the processing elements must be as automated as

possible.

Page 6: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Qilin :

is one such experimental system that tends to fully automate the approach of mapping from computation to processing elements using run-time adaptation.

It currently focuses on CPU+GPU platforms, but the adaptive mapping approach it follows is applicable to heterogeneous platforms in general.

What is QILIN ?

Page 7: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

�Qilin provides an API to programmers for writing parallelizable operations.

�The compiler is alleviated from the difficult job of extracting parallelism from serial code.

�Compiler can focus on performance tuning. �Beneath the API layer is the Qilin system layer, which

consists of a dynamic compiler and its code cache, a number of libraries, a set of development tools, and a scheduler.

�compiler dynamically translates the API calls into native machine codes and decides the near-optimal mapping from computations to processing elements using an adaptive algorithm.

The Qilin Programming System

Page 8: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

The Two Qilin Programming Styles

�Stream-API approach: where the programmer solely uses the

Qilin stream API which implements common data-parallel operations including elementwise, reduction, and linear algebra operations on qilin defined arrays called Qarrays.

�Threading-API approach: in which the programmer provides

the parallel implementations of computations in the threading APIs on which Qilin is built (i.e., TBB on the CPU side and CUDA on the GPU side for our current implementation).

My work was to make some of the benchmarks using Threading-API approach and measure the results of those benchmarks for different type of GPU’s.

Page 9: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Writing Benchmarks for QILIN For writing Qilin benchmarks you need to provide the CUDA version for GPU’s and OpenMP version for CPU’s.

Following are the main steps your benchmark must have: main():

Qilin_Init //initiating Qilin Qilin_RegisterType //type of registers you want to use PreKernel //to read the command line arguments MakeQArrayOp //define CPU and GPU operations for Qilin Create1D or Create2D (for each parameter type) //create Qilin arrays MakeQArrayOpArg (for each previous arg) push_back Push these args to a list reference runs //number of times you want to run the benchmarks ApplyQArrayOp // apply the data to the CPU and GPU functions ToNormalArray //convert back to normal array CompareReferenceResults PostKernel Qilin_Shutdown //shutdown qilin

Page 10: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

void CpuFilter(QArray<Pixel> qSrc, QArray<Pixel> qDst) {

Pixel* src_cpu = qSrc.NormalArray(), dst_cpu = qDst.NormalArray(); int height_cpu = qSrc.DimSize(0), width_cpu = qSrc.DimSize(1); // Filter implementation in TBB

}

void GpuFilter(QArray<Pixel> qSrc, QArray<Pixel> qDst) {

Pixel* src_gpu = qSrc.NormalArray(), dst_gpu = qDst.NormalArray(); int height_gpu = qSrc.DimSize(0), width_gpu = qSrc.DimSize(1); //Filter implementation in CUDA

}

void MyFilter(Pixel* src, Pixel* dst, int height, int width) {

// Create Qilin arrays from normal arrays QArray<Pixel> qSrc = QArray<Pixel>::Create2D(height, width, src); QArray<Pixel> qDst = QArray<Pixel>::Create2D(height, width, dst); // Define myFilter as an operation that glues CpuFilter() and GpuFilter() QArrayOp myFilter = MakeQArrayOp(“myFilter”, CpuFilter, GpuFilter); // Build the argument list for myFilter. QILIN_PARTITIONABLE means the // associated computation can be partitioned to run on both CPU and GPU. QArrayOpArgsList argList; argList.Insert(qSrc, QILIN_PARTITIONABLE); argList.Insert(qDst, QILIN_PARTITIONABLE); // Apply myFilter with argList using the default mapping scheme QArray<BOOL> qSuccess = ApplyQArrayOp(myFilter, argList, PE_SELECTOR_DEFAULT); // Convert from qSuccess[] to success, and this triggers the lazy evaluation BOOL success; qSuccess.ToNormalArray(&success, sizeof(BOOL)); }

Page 11: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Benchmarks for QILIN

�Binomial �Convolve �Gemv �Matrix Multiplication �Sepia �Smithwat �SVM �Kmeans clustering �LU Decomposition

Page 12: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Results

�Speedup: � On average, adaptive mapping is 69% faster than

always using the CPU and 33% faster than always using the GPU. This is true for majority of cases.

� QILIN also shows good results with varying Input size as well as adapts well with changing Hardware configuration like using different GPU’s.

We tested benchmarks with different number of CPU threads and also at various CPU frequencies.

Page 13: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Few results are: Graphic card GeForce GTS 450 with 8 CPU

threads and 1596000 CPU frequency.

0

2

4

6

8

10

12

14

16

0

5

10

15

20

Graphic card GeForce GTS 450 with 12 CPU threads and 1596000 CPU frequency.

Qilin_Binomial

0

5

10

15

20

0

5

10

15

20

qilin_matmul

0

1

2

3

4

5

6

7

qilin_sepia

0

2

4

6

8

Page 14: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Graphic card GTX 560 with 8 CPU threads and 1596000 CPU frequency.

Graphic card GTX 560 with 12 CPU threads and 1596000 CPU frequency.

0

5

10

15

20

25

0

5

10

15

20

25

30

0

2

4

6

8

0

5

10

15

20

25

0

5

10

15

20

25

30

0

1

2

3

4

5

6

7

8

Qilin_Binomial

qilin_matmul

qilin_matmul

Page 15: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Visualizing performance/power trade-offs

� Finding non-trivial relationship between power,

energy and time.

� measurement of Power used during the

computation for various benchmarks using external power meter.

Page 16: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Consider the "baseline" system to be 12 CPU cores (threads), each running at the maximum 2.4 GHz clock rate. All data is normalized to this baseline, which we might think of as "full

power CPU."

The question we are interested in is as follows:

Does adding a GPU while decreasing CPU threads/clock give a speedup while

reducing overall power?

GRAPHS

Page 17: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Each point represents some fraction offloaded to the CPU ( in 25% intervals) and some number of threads.

Page 18: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

we are looking for configurations in the upper-left quadrant of each subplot.

Page 19: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

An interesting case would be if we had a triangle (8threads) appear in the top row (lowest clock of 1.6 GHz) with a GPU offloading (any color but red) and in the upper-left quadrant (speedup with power savings).

Points to note:

�GPU offloading showing points showing about 2.3x speed with more than 10% reduction in Power (at 2.40GHz).

�There points showing 1.5x or more speedup at lowest clock rate (1.5GHz) with less than a 1.4x increase on power.

However,

Page 20: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Another way to look into power data:

Page 21: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model
Page 22: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Speedup vs Energyup

Page 23: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model
Page 24: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model
Page 25: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Heat maps

We plotted the heat maps from the data collected from the power meter.

� Heat maps for GTX580

� Heat maps for GeForce GTS450.

� Heat maps for GTX 560.

Page 26: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

FUTURE WORK

�Trying more interesting benchmarks.

�Need more efficient and accurate methods for power measurement.

�Testing for different hardware configurations.

�extend Qilin's performance model to be able to predict the performance / power trade-offs observed in slides 17 and 18.

Page 27: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

References: � C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on

heterogeneous multiprocessors with adaptive mapping. In Micro-42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 45–55, New York, NY, USA, 2009. ACM.

� Shuai Che, Jeremy W. Sheaffer, Michael Boyer, Lukasz G. Szafaryn, Liang Wang, and Kevin Skadron. 2010. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10) (IISWC '10).

� Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., & Burger, D. (2011). Dark Silicon and the End of Multicore Scaling. Proceedings of the 28th International Symposiumn Computer Architecture (ISCA). San Jose, CA, USA.

� Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010. An Improved Magma Gemm For Fermi Graphics Processing Units. Int. J. High Perform. Comput. Appl. 24, 4 (November 2010), 511-515.

Page 28: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Questions

Page 29: A study of performance and power trade-offs on heterogeneous architectures using the Qilin programming model

Thank You