Code Optimisations and Performance Models for MATLAB

Code Optimisations andPerformance Models for MATLAB

Patryk Kiepas1,2, Claude Tadonki1, Corinne Ancourt1

Jaros law Kozlak2

1MINES ParisTech/PSL University

2AGH University of Science and Technology, Poland

January 30, 2019

1 / 28

Outline

Motivation – Why MATLAB?

Three approaches to speedup MATLAB

Code transformationsLoop coalescingLoop interchangeLoop unrollingStrength reduction (power)

Problem with vectorization

MATLAB is JIT compiling

Building an optimisation heuristics

Conclusions

2 / 28

MATLAB is popular

Figure: TIOBE Index for December 2018. https://www.tiobe.com/tiobe-index/

3 / 28

Motivation

MATLAB

+ Dynamic language with simple and intuitive syntax

+ Great for fast-prototypingI Built-ins: 2940 (R2018b)I MATALB toolboxes: 66 (e.g. phased array, aerospace)

− Vendor lock-in, closed source

− Lack of formal semantics

− Performance is lagging behind other solutions

4 / 28

Performance comparison

C Julia LuaJIT Rust Go Fortran Java JavaScriptMatlab Mathematica Python R Octave

iteration_pi_summatrix_multiplymatrix_statisticsparse_integersprint_to_filerecursion_fibonaccirecursion_quicksortuserfunc_mandelbrot

benchmark

Figure: Julia Micro-Benchmarks. https://julialang.org/benchmarks/

5 / 28

Three approaches to speedup MATLAB

MATLAB code C, C++, Fortran

MATLAB code(optmized)

Third-partyinterpreter

Translation

New interpretation

Transformation

6 / 28

Existing solutions

I New interpretationI Scilab1

I Octave2

I MaJIC [Almasi and Padua, 2001]I McVM [Chevalier-boisvert, 2009]

I TranslationI MATALB Coder (C) – official MathWork’s compilerI SILKAN eVariX3 (C)I Menhir (C) [Chauveau and Bodin, 1999]I Mc2For (Fortran) [Chen et al., 2017]I FALCON (Fortran) [DeRose et al., 1995]

I TransformationI Mc2Mc [Chen et al., 2017] – performs vectorization

1https://www.scilab.org/2https://www.gnu.org/software/octave/3http://www.silkan.com/products/evarix/

7 / 28

Loop coalescing

Before:

for k = 1:N

for l = 1:M

a(l, k) = a(l, k) + c;

After:

for T = 1:(N .* M)

a(T) = a(T) + c;

MATLAB R2018b

MATLAB R2015b

MATLAB R2013a

0 300 600 900

Iterations

Version: Original loop Coalesced loop

Experiment setup: Ubuntu 16.04.5 LTS, Intel(R) Core(TM) i7−6600U CPU @ 2.60GHz, 16GB DDR4−2133MHzResults with confidence intervals over 30 measurements with warmup phase consideration

Single−thread execution, measured with PAPI 5.6

Example: Bacon, D. F., Graham, S. L., & Sharp, O. J. (1994). Compiler transformations for high-performancecomputing. ACM Computing Surveys, 26(4), 345–420. 8 / 28

Loop interchange

Before:

for k = 1:N

for l = 1:M

total(k) = total(k) + a(k, l);

After:

for l = 1:M

for k = 1:N

total(k) = total(k) + a(k, l);

MATLAB R2018b

MATLAB R2015b

MATLAB R2013a

0 300 600 900

0.0e+00

5.0e+07

1.0e+08

1.5e+08

0.0e+00

5.0e+06

1.0e+07

1.5e+07

Iterations

Version: Original loop Interchanged loops

Loop unrolling

Before:

for k = 2:(N - 1)

a(k) = a(k) + a(k-1) .* a(k+1);

After:

for k = 2:2:(N - 2)

a(k) = a(k) + a(k-1) .* a(k+1);

a(k+1) = a(k+1) + a(k) .* a(k+2);

if mod((N-2), 2) == 1

a(N-1) = a(N-1) + a(N-2) .* a(N);

MATLAB R2018b

MATLAB R2015b

MATLAB R2013a

0 50000 100000 150000 200000

Iterations

Version: Original loop Unrolled loop

Strength reduction (power)

Before:

for k = 1:N

a(k) = a(k) + c.^k;

After:

T = c;

for k = 1:N

a(k) = a(k) + T;

T = T .* c;

MATLAB R2018b

MATLAB R2015b

MATLAB R2013a

0 50000 100000 150000 200000

Iterations

Version: Original loop Simplified loop

Vectorization in MATLAB

% scalar form

for i = 1:N

c(i)=a(i)*b(i)

% vector form

c(1:N)=a(1:N).*b(1:N)

% after simplification

c=a.*b

I For many years vectorization was a prevalent optimisation,usually applied systematically

+ Performing more floating-point operations simultaneously

− Sometimes decreases performance in comparison toJIT-compiled loops (Chen et al. 2017 and Kiepas et al. 2018)

12 / 28

Reproduction of [Chen et al., 2017]I Benchmarks from Ostrich-suite4

I Vectorized with Mc2McI Executed on MATLAB R2015b

Benchmark Dwarf Chen et al. Us

backprop unstructured grid 0.71 0.81bs – 15.0 8.33

capr dense linear algebra 0.79 0.85crni structured grid 0.83 0.81

fft spectral method 0.59 0.64nw dynamic programming 0.96 1.00

pagerank Monte Carlo/MapReduce 0.94 0.94mc Monte Carlo/MapReduce 2.02 2.22

spmv sparse linear algebra 0.013 0.02

Table: Kiepas, P., Kozlak, J., Tadonki, C., & Ancourt, C. (2018). Profile-based vectorization for MATLAB.ARRAY 2018 (pp. 18–23).

4https://github.com/Sable/Ostrich213 / 28

Is vectorization still relevant?

0 500 1000 1500 2000

Iterations (data size)

backprop1

Baseline

Figure: Kiepas, P., Kozlak, J., Tadonki, C., & Ancourt, C. (2018). Profile-based vectorization for MATLAB.ARRAY 2018 (pp. 18–23).

14 / 28

Improving Mc2Mc code generation

Range inlining

% From

k = 1:N;

B = A(k) + 2;

B = A(1:N) + 2;

Range conversion

% From

B = A(2*(1:N) -1);

B = A(1:2:(2*N-1));

Removing explicit index-all

% From

B(:) = A(1:end);

B = A;

15 / 28

Profitable vectorization point (PV)

Loop Benchmark iterations PV iterations Improved PV iterations

backprop1 {17, 2850001} ∅ ≥ 255backprop2 2 ≥ 4033 ≥ 257backprop3 {17, 2850001} ∅ ≥ 385backprop4 2 ∅ ≥ 257

capr1 8 ≥ 20 ≥ 17capr2 20 ≥ 3329 ≥ 385capr3 49 ≥ 5953 ≥ 321crni1 2300 ≥ 161 ≥ 193crni2 2300 ∅ ≥ 289crni3 2300 ∅ ≥ 1217

fft1 256 ∅ ≥ 417fft2 2, 4, 8 . . . 256 ∅ ≥ 129

nw1 4097 ∅ ≥ 65nw2 4097 ≥ 1665 ≥ 257nw3 4097 ≥ 7681 ≥ 193

pagerank1 1000 ∅ ≥ 273spmv1 {2, 3} ≥ 6337 ≥ 321

Table: Kiepas, P., Kozlak, J., Tadonki, C., & Ancourt, C. (2018). Profile-based vectorization for MATLAB.ARRAY 2018 (pp. 18–23).

16 / 28

Profile-guided vectorization

backprop crni fft nw pagerank

Strategy

Systematic

Selective (optimized)

Benchmark

Speedup

Baseline

Figure: Kiepas, P., Kozlak, J., Tadonki, C., & Ancourt, C. (2018). Profile-based vectorization for MATLAB.ARRAY 2018 (pp. 18–23).

17 / 28

A bit of history of MATLAB

I Starts as an interpreter (1984)

I Introduces JIT along the interpreter around 6.5 (2002)

I Combines JIT with the interpreter in R2015b

I Introduces PGO (profile-driven optimisations) around R2018b

18 / 28

Warmup phaseWarmup is an observable effect of some JIT policy performingcompilation on a code. Policy is a set of rules if, when and how tocompile the code [Kulkarni 2011].

[Kulkarni 2011]: Kulkarni, P. A. (2011). JIT compilation policy for modern machines. ACM SIGPLAN Notices,46(10), 773.

19 / 28

Warmup phase patterns

0 50 100 150 200 250 300

backprop, R2018b, process #1 (warmup)

in−process iteration

0 50 100 150 200 250 300

nqueens, R2015b, process #8 (warmup)

0 50 100 150 200 250 300

bubble, R2013a, process #1 (slowdown)

The patterns come in different flavours [Barrett et al. 2017]:

I Warmup

I Slowdown

I Flat

I Inconsistent[Barrett et al. 2017]: Barrett, E., Bolz-Tereick, C. F., Killick, R., Mount, S., & Tratt, L. (2017). Virtual machinewarmup blows hot and cold. Proceedings of the ACM on Programming Languages, vol. 1 (Issue OOPSLA), 1–27.

20 / 28

About our heuristics

Our heuristics is a binary choice (optimise – positive / do nothing– negative) that takes into consideration the code, trip countand/or the machine’s properties.

Designing goal

Prefer being conservative (false negatives FN are OK) thanoptimising wrongly (false positives FP > 0).

precision =TP

TP + FP→ 1 (1)

However, too much FN means we are optimising only a little!

21 / 28

1. Handcrafted optimisation heuristics

We pose a question: What does vectorization change?

Store instructions (PAPI_SR_INS) Cycles with no instruction finished (PAPI_STL_CCY)

Conditional branches (PAPI_BR_CN) Load instructions (PAPI_LD_INS)

0 100 200 300 400 0 100 200 300 400

Iterations

TSVC/s1115/MATLAB R2013a

Experiment setup: Ubuntu 16.04.5 LTS, Intel(R) Core(TM) i7−6600U CPU @ 2.60GHz, 16GB DDR4−2133MHzResults from 30 measurements with warmup phase consideration

22 / 28

Precision

100.0%

0.0 0.5 1.0 1.5 2.0

Threshold

Ratio of change

Stores

Branches

Stalls

Precison of handcrafted heuristics; TSVC Benchmark Suite; R2013a

23 / 28

2. Automatic dynamic model

Followed by the work of [Cavazos et al., 2007] – we have build amodel using machine learning and dynamic set of features(performance counters).

Methodology

1. Collecting performance counters (TSVC Benchmark Suite)

2. Normalising (by PAPI TOT INS, hybrid)

3. Oversampling for dealing with class imbalance

4. Training on TSVC, testing on LCPC16 [Chen et al., 2017]

5. Only out-of-the-box components, no fine-tuning(meta-learning, hyper parameter optimisations)

[Cavazos et al. 2007]: Cavazos, J., Fursin, G., Agakov, F., Bonilla, E., O’Boyle, M. F. P., & Temam, O. (2007).Rapidly Selecting Good Compiler Optimizations using Performance Counters. CGO’07 (pp. 185–197).

24 / 28

Evaluation

Test Metrics AdaBoost Decision Tree (CART)

TSVC (Cross-validation5)Precision (%) 96.63 % 97.02 %Accuracy (%) 94.38 % 93.95 %

LCPC16 Test setPrecision (%) 99.51 % 99.36 %Accuracy (%) 92.85 % 72.26 %

510-folds25 / 28

Decision tree

N ≤ 361.0entropy = 0.647samples = 1652

value = [273, 1379]class = VECTORIZE

N ≤ 63.5entropy = 0.965samples = 394

value = [240, 154]class = NOTHING

PAPI_L2_ICM ≤ 0.008entropy = 0.175samples = 1258

PAPI_BR_MSP ≤ 0.002entropy = 0.32samples = 155value = [146, 9]

class = NOTHING

PAPI_STL_ICY ≤ 0.163entropy = 0.967samples = 239

PAPI_STL_CCY ≤ 0.069entropy = 0.177samples = 150value = [146, 4]

class = NOTHING

entropy = 0.0samples = 5value = [0, 5]

class = VECTORIZE

PAPI_BR_CN ≤ 0.202entropy = 0.059samples = 147value = [146, 1]

class = NOTHING

class = VECTORIZE

PAPI_STL_ICY ≤ 0.002entropy = 0.889samples = 209

class = NOTHING

FP_ARITH:SCALAR_DOUBLE ≤ 0.012entropy = 0.449samples = 64value = [6, 58]

class = VECTORIZE

PAPI_L1_STM ≤ 0.0entropy = 0.971samples = 145value = [58, 87]

class = VECTORIZE

PAPI_SR_INS ≤ 0.148entropy = 0.211samples = 60value = [2, 58]

class = VECTORIZE

class = NOTHING

class = VECTORIZE

PAPI_PRF_DM ≤ 0.001entropy = 0.559samples = 23value = [20, 3]

class = NOTHING

class = VECTORIZE

class = NOTHING

class = VECTORIZE

PAPI_TLB_DM ≤ 0.001entropy = 0.573samples = 59value = [8, 51]

class = VECTORIZE

PAPI_L1_STM ≤ 0.0entropy = 0.998samples = 63

N ≤ 134.0entropy = 0.485samples = 57value = [6, 51]

class = VECTORIZE

class = NOTHING

PAPI_FUL_ICY ≤ 0.173entropy = 0.323samples = 51value = [3, 48]

class = VECTORIZE

class = NOTHING

PAPI_BR_UCN ≤ 0.07entropy = 0.575samples = 22value = [3, 19]

class = VECTORIZE

PAPI_L2_LDM ≤ 0.002entropy = 0.811

samples = 4value = [3, 1]

class = NOTHING

class = VECTORIZE

class = NOTHING

class = VECTORIZE

class = NOTHING

class = VECTORIZE

class = NOTHING

PAPI_MEM_WCY ≤ 0.0entropy = 0.863

class = VECTORIZE

class = NOTHING

class = VECTORIZE

PAPI_TLB_DM ≤ 0.0entropy = 0.949samples = 68

class = VECTORIZE

class = NOTHING

class = VECTORIZE

PAPI_FUL_ICY ≤ 0.171entropy = 0.696samples = 16value = [3, 13]

class = VECTORIZE

PAPI_LD_INS ≤ 0.302entropy = 0.954

class = VECTORIZE

class = NOTHING

class = VECTORIZE

class = NOTHING

PAPI_L2_DCM ≤ 0.04entropy = 0.276samples = 21value = [1, 20]

class = VECTORIZE

PAPI_LD_INS ≤ 0.308entropy = 1.0samples = 47

class = VECTORIZE

class = NOTHING

PAPI_BR_CN ≤ 0.19entropy = 0.907samples = 31

PAPI_STL_ICY ≤ 0.15entropy = 0.696samples = 16value = [3, 13]

class = VECTORIZE

class = NOTHING

class = VECTORIZE

class = NOTHING

class = VECTORIZE

class = NOTHING

class = VECTORIZE

PAPI_RES_STL ≤ 9.646entropy = 0.811

class = NOTHING

class = VECTORIZE

PAPI_RES_STL ≤ 0.512entropy = 0.811

class = NOTHING

class = VECTORIZE

26 / 28

3. Automatic static model

Image: Cummins, C., Petoumenos, P., Wang, Z., and Leather, H.(2017). End-to-End Deep Learning of Optimization Heuristics. In2017 26th IEEE International Conference on Parallel Architectures andCompilation Techniques (PACT). [Cummins et al., 2017]

I Sequences of codes arethe input

I Auxiliary inputs: numberof iterations

I No dynamic features

I In order to force learningfrom sequences – shortensequences (less padding)

I Small precision – moredata? Around 1652 datapoints, but only 118 codesequences.

27 / 28

Conclusions

I Working optimisation heuristics without opening theMATLAB’s black-box (which might be infeasible)

I Deeper understanding of how to measure MATLAB’sperformance

I Perspective: fine-tuning of models and extending evaluationfor other machines and versions of MATLAB

Thank you!

28 / 28

Almasi, G. and Padua, D. (2001).MaJIC: A Matlab just-in-time Compiler.In Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes inBioinformatics), volume 2017, pages 68–81.

Cavazos, J., Fursin, G., Agakov, F., Bonilla, E., O’Boyle,M. F., and Temam, O. (2007).Rapidly Selecting Good Compiler Optimizations usingPerformance Counters.In International Symposium on Code Generation andOptimization (CGO’07), pages 185–197. IEEE.

Chauveau, S. and Bodin, F. (1999).Menhir: An Environment for High Performance Matlab.Scientific Programming, 7(3-4):303–312.

Chen, H., Krolik, A., Lavoie, E., and Hendren, L. (2017).Automatic Vectorization for MATLAB.

28 / 28

In Ding, C., Criswell, J., and Wu, P., editors, Lecture Notes inComputer Science (including subseries Lecture Notes inArtificial Intelligence and Lecture Notes in Bioinformatics),volume 10136 LNCS of Lecture Notes in Computer Science,pages 171–187. Springer International Publishing, Cham.

Chevalier-boisvert, M. (2009).MCVM : An Optimizing Virtual Machine for The MATLABProgramming Language.

Cummins, C., Petoumenos, P., Wang, Z., and Leather, H.(2017).End-to-End Deep Learning of Optimization Heuristics.In 2017 26th International Conference on ParallelArchitectures and Compilation Techniques (PACT), volume2017-Septe, pages 219–232. IEEE.

DeRose, L., Gallivan, K., Gallopoulos, E., Marsolf, B. A., andPadua, D. (1995).FALCON: An Environment for the Development of ScientificLibraries and Applications.

28 / 28

Proc. First International Workshop on Knowledge-BasedSystem for the (re)Use of Program Libraries, (November).

28 / 28

Code Optimisations and Performance Models for MATLAB

Documents

High level OCaml optimisations

Numerical Simulations of DSGE Models with MATLAB R

Optimisations des systèmes structuraux de couverture en ...

New Models of a Z-Source Inverter Built of Standard MATLAB.....

Useless performance optimisations on the BEAM · Useless...

What’s New in MATLAB and Simulinktechniques in MATLAB....

Cache Optimisations

Améliorer les performances Web - Les optimisations côté.....

YouTube Channel Optimisations

Controller design for ADAMS models using Matlab/SIMULINK...

Modeling and Simulating Social Systems with MATLAB ·...

Performance Optimisations for HPC workloads

Performance optimisations PHP meetup Rotterdam

PoWA 3 - Optimisations avancées de PostgreSQL

Tooth-form Optimisations and Modifications specifically for....

Android Optimisations Greendroid