Top Banner
1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC [email protected] [email protected] May 7, 2015
84

Bart Oldeman, McGill HPC [email protected] … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC [email protected] [email protected] May 7, 2015

Sep 27, 2018

Download

Documents

docong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

1

Intel Xeon Phi Workshop

Bart Oldeman, McGill [email protected]

[email protected] 7, 2015

Page 2: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

2

Online Slides

● http://tinyurl.com/xeon-phi-handout-may2015– OR

● http://www.hpc.mcgill.ca/downloads/xeon_phi_workshop_may2015/handout.pdf

Page 3: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

3

Outline

● Login and Setup● Overview of Xeon Phi● Interacting with Xeon Phi

– Linux shell

– Ways to program the Xeon Phi

● Native Programming● Offload Programming● Choosing how to run your code

Page 4: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

4

Exercise 0: Login and Setup

● Please use class accounts to access reserved resources

● $ ssh class##@guillimin.hpc.mcgill.ca

– enter password● [class##@lg­1r17­n01]$ module add ifort_icc/15.0

● ifort_icc module: In Intel's documentation you will see instructions to run 'source compilervars.sh'– On guillimin, this script is replaced by this module

Page 5: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

5

Exercise 0: Workshop Files

● Please copy the workshop files to your home directory– $ cp ­R /software/workshop/phi/* ~/.

● Contains:– Code for the exercises

– An example submission script

– Solutions to the exercises

Page 6: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

6

Exercise 1: Interactive Session

● For the workshop, we want interactive access to the phis

● [class01@lg­1r17­n01]$ qsub ­I ­l nodes=1:ppn=16:mics=2 ­l walltime=8:00:00

● [class01@aw­4r13­n01]$ module add ifort_icc/15.0

Page 7: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

7

Exercise 2: OpenMP on CPU

● Compile the OpenMP program axpy_omp.c– $ icc ­openmp ­o axpy axpy_omp.c

● Run this program on the aw node– $ ./axpy

● This program ran on the host● Use OMP_NUM_THREADS environment variable

to control the number of parallel threads

Page 8: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

8

What is a Xeon Phi?

Page 9: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

9

What is a Xeon Phi?

● A device for handling computationally expensive hot spots in your code ('co-processor' or 'accelerator')

● Large number of low-powered, but low cost (computational overhead, power, size, monetary cost) processors (modified Pentium cores)

● “Supercomputer on a chip”: Teraflops through massive parallelism (dozens or 100s of parallel threads)

● Heterogeneous computing: Host and Phi can work together on the problem

Page 10: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

10

What was the ASCI Red?

● 1997, first teraflop supercomputer, same compute power as single Xeon Phi● 4,510 nodes (9298 processors), total 1,212 GB of RAM,12.5 TB of disk storage● 850 kW vs. 225 W for Xeon Phi

Page 11: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

11

Performance vs. parallelism

(c) 2013 Jim Jeffers and James Reinders, used with permission.

Page 12: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

12

Performance vs. parallelism

(c) 2013 Jim Jeffers and James Reinders, used with permission.

Page 13: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

13

Performance vs. parallelism

(c) 2013 Jim Jeffers and James Reinders, used with permission.

Page 14: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

14

Performance vs. parallelism

(c) 2013 Jim Jeffers and James Reinders, used with permission.

Page 15: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

15

Terminology

● MIC = Many Integrated Cores (Intel developed architecture)● GPU = Graphics Processing Unit (Xeon Phi is not a GPU,

but we will refer to GPUs)● Possibly confusing terminology:

– Architecture: Many Integrated Cores (MIC)

– Product name (uses MIC architecture): Intel Xeon Phi

– Development Codename: Knight's corner

– “The device” (in contrast to “the host”)

– “The target” of an offload statement

Page 16: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

16

MIC architecture under the hood

Page 17: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

17

MIC architecture under the hood

(c) 2013 Jim Jeffers and James Reinders, used with permission.

Page 18: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

18

Xeon Phis on Guillimin

● Nodes– 50 Xeon Phi nodes, 2 devices per node = 100 Xeon Phis

– 2 x Intel Sandy Bridge EP E5-2670 (8-core, 2.6 GHz, 20MB Cache, 115W)

– 64 GB RAM

● Cards– 2 x Intel Xeon Phi 5110P

– 60 cores, 1.053 GHz, 30 MB cache, 8 GB memory (GDDR5), Peak SP FP: 2.0 TFlops, Peak DP FP: 1.0 Tflops (=1.053GHz*60 cores*8 vector lanes*2 flops/FMA)

Page 19: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

19

Comparisons

Notes:• Chart denotes theoretical maximum values. Actual performance is application

dependent• The K20 GPU has 13 streaming multiprocessors (SMXs) with 2496 CUDA cores, not

directly comparable to x86 cores• The K20 GPU and Xeon Phi have GDDR5 memory, the Sandy Bridge has DDR3

memory• Accelerator workloads can be shared with the host CPUs

Page 20: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

20

Benchmark Tests

Matrix multiplication resultsSE10P is a Xeon Phi Coprocessor with slightly higher specifications than the 5110PSource: Saule et. al, 2013 - http://arxiv.org/abs/1302.1078

Page 21: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

21

Benchmark Tests

Embarrassingly parallel financial Monte-Carlo

Iterative financial Monte-Carlo with regression across all paths

Source: xcelerit blog, Sept. 4, 2013 (http://blog.xcelerit.com/intel-xeon-phi-vs-nvidia-tesla-gpu/)

“Tesla GPU” is a K20X, which has slightly higher specifications than the K20

Page 22: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

22

How can accelerators help you do science?

● Two ways of thinking about speedup from parallelism:– 1: Compute a fixed-size

problem faster● Amdahl's law describes

diminishing returns from adding more processors

– 2: Choose larger problems in the time you have

● Gustafson's law: Problem size can often scale linearly with number of processors

Page 23: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

23

Ways to use accelerators

● Accelerated application ● Libraries ● Directives and Pragmas

(OpenMP)● Explicit parallel programming

(OpenMP, MPI, OpenCL, TBB, etc.)

Increasing effort

Page 24: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

24

MIC Software Stack

Page 25: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

25

MIC Linux Libraries

Library Purpose

glibc GNU C standard library

libc The C standard library

libm The math library

libdl Dynamic linking

librt POSIX real-time library (shared memory, time, etc.)

libcrypt Passwords, encryption

libutil Utility functions

libstdc++ GNU C++ standard library

libgcc_s Low-level functions for gcc

libz Lossless compression

libcurses Displaying characters in terminal

libpam Authentication

The following Linux Standard Base (LSB) libraries are available on the Xeon Phi

Page 26: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

26

Focus For Today

● Some - Accelerated Applications and Libraries– Incredibly useful for research, relatively easy to use

– Will not teach you much about how Xeon Phis work

● Some - Explicit Programming– Device supports your favourite parallel programming models

– We will keep the programming simple for the workshop

● Yes - Directives/Pragmas and compilation– Will teach you about Xeon Phis

– We will focus mainly on OpenMP and MPI as parallel programming models

Page 27: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

27

Scheduling Xeon Phi Jobs

● Workshop jobs run on a single node + one or two Xeon Phi devices– $ qsub ­l nodes=1:ppn=16:mics=2

● $ qsub ./subJob.sh

● Example: subMiccheck.sh

Page 28: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

28

Exercise 3: Interacting with Phi

● Use your interactive session on the phi nodes● Log in to phi cards

– $ ssh mic0

● Try some of your favourite Linux commands– $ cat /proc/cpuinfo | less

– $ cat /proc/meminfo | less

– $ cat /etc/issue

– $ uname ­a

– $ env

● How many cores are available? How much memory is available? What operating system is running? What special environment variables are set?

Page 29: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

29

Filesystem on Phi

● The guillimin GPFS filesystem is mounted on the Xeon Phis using NFS

● You can access your home directory, project space(s), scratch, and the /software directory

● In general, reading and writing to the file system from the phi is very slow

● Performance tip: minimize data transfers (and therefore file system use) from the Phi and use /tmp for temporary files (file system in 8 GB memory)

Page 30: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

30

Native mode and Offload mode

● There are two main ways to use the phi– Compile a program to run directly on the device (native

mode)

– Compile a program to run on the CPU, but offload hotspots to the device (offload mode, heterogeneous computing)

– Offload is more versatile● Uses resources of both node and device

Page 31: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

31

Automatic Offload mode

● Offloading linear algebra to the Phi using MKL– Only need to set MKL_MIC_ENABLE=1

– Can be used by Python, R, Octave, Matlab, etc.

– Or from Intel C/C++/Fortran using -mkl switch

– Only effective for large matrices, at least 512x512 to 9216x9216, depending on function

– Uses both node and device

– Example: module python/2.7.3-MKL

Page 32: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

32

Exercise 4: Automatic offload

● See the script matmul.py, multiplying two random 8192x8192 matrices● Run on host only

– $module add python/2.7.3­MKL

– $python matmul.py

– 8192,  4.160015,  264.2886

● Use automatic offload– $export MKL_MIC_ENABLE=1

– $export OFFLOAD_REPORT=1

– $python matmul.py

– 8192,  2.237704,  491.3271

● Now experiment with smaller and larger values of M, N, and K.

Page 33: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

33

Exercise 5: Native Compilation

● axpy_omp.c is a regular OpenMP vector a*x+y program – Only phi-specific code is within a #ifdef OFFLOAD pre-compiler

conditional

● Use the compiler option -mmic to compile for native mic execution– $icc ­o axpy_omp.MIC ­mmic ­openmp axpy_omp.c

● Attempt to run this program on the CPU. – The Linux kernel automatically runs it on the Phi as “micrun

./axpy_omp.MIC”, where /usr/bin/micrun is a shell script.

● Attempt to run this program via ssh. – $ ssh mic0 ./axpy_omp.MIC

– It fails. Why?

Page 34: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

34

Exercise 5: Native Compilation

● Using plain ssh the library paths are not set up properly! Alternatively, use micnativeloadex to copy libraries from the host and execute the program on the mic.– $ micnativeloadex ./axpy_omp.MIC

● Environment variables can be changed via the MIC_ prefix:– MIC_OMP_NUM_THREADS=60 ./axpy_omp.MIC

● Device number selected with the OFFLOAD_DEVICES variable (e.g. 1)– $ OFFLOAD_DEVICES=1 ./axpy_omp.MIC

Page 35: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

35

ResultsHost: $ icc ­openmp ­o axpy axpy_omp.c; ./axpy

OpenMP threads: 16

GFLOPS =    134.218, SECS =     11.103, GFLOPS per sec =     12.088

Offload: $ icc ­DOFFLOAD ­openmp ­o axpy_offload axpy_omp.c; OMP_PLACES=threads ./axpy_offload

OpenMP threads: 16

OpenMP 4 Offload 

GFLOPS =    134.218, SECS =      5.959, GFLOPS per sec =     22.523

Native: $ icc ­openmp ­o axpy.MIC ­mmic  axpy_omp.c; MIC_OMP_PLACES=threads ./axpy.MIC

OpenMP threads: 240

GFLOPS =    134.218, SECS =      5.317, GFLOPS per sec =     25.241

Host (16 Sandybridge cores) Offload to Xeon Phi

Native on Xeon Phi

1.9x

2.1x

Page 36: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

36

Offload Mode

● Offload computational hotspots to a Xeon Phi device● Requires instructions in the code

– Intel's offload pragmas● Older, more documentation available● Vendor lock-in (Code depends on hardware and compilers from single supplier)● Used in previous workshops, see intel_offload folder for examples

– OpenMP 4.0● Open standard● More high-level than Intel's pragmas (compiler knows more)● Device agnostic (use with hosts, GPUs, or Phis)● Currently, only newest compilers support it (ifort_icc/14.0.4 and ifort_icc/15.0)● Used in this workshop

– OpenCL (module add intel_opencl)● Lower level● Open standard, device agnostic

– Other standards will likely emerge● e.g. CAPS compiler by French company CAPS entreprise supports OpenACC for Xeon Phi. Also future GCC 5.0.

Page 37: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

37

Offload Mode (OpenMP4 - C/C++)

● Program runs on the CPU● Programmer specified hotspots are 'offloaded' to the device

– #pragma omp target device(1)

● Variables and functions can be declared on the device– #pragma omp declare target

– static int *data;

– #pragma omp end declare target

● Data is usually copied to and from the device (data can be an array section)– map(tofrom:data[5:3])

● Data can also be allocated on the device without copying– map(alloc:data[:20])

Page 38: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

38

Offload Mode (OpenMP4 - Fortran)

● Program runs on the CPU● Programmer specified hotspots are 'offloaded' to the device

– !$omp target device(0)

– parallel­loop, parallel­section

– !$omp end target

● Variables, subroutines, functions can be declared on the device– !$omp declare target (data)

● Data is usually copied to and from the device– map(tofrom:data)

Page 39: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

39

Offload Mode (Intel - C/C++)

● Program runs on the CPU● Programmer specified hotspots are 'offloaded' to the device

– #pragma offload target(mic:0)

● Variables can be declared on the device– #pragma offload_attribute(push, target(mic))

– static int *data;

– #pragma offload_attribute(pop)

● Data is usually copied to and from the device– in(varname : length(arraylength))

– out(varname : length(arraylength))

– inout(varname : length(arraylength))

Page 40: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

40

Offload Mode (Intel - Fortran)

● Program runs on the CPU● Programmer specified hotspots are 'offloaded' to the device

– !DIR$ OFFLOAD BEGIN target(mic:0)

– ...

– !DIR$ END OFFLOAD

● Variables can be declared on the device– !DIR$ OPTIONS /offload_attribute_target=mic

– integer, dimension(:) :: data

– !DIR$ END OPTIONS

● Data is usually copied to and from the device– in(varname : length(arraylength))

– out(varname : length(arraylength))

– inout(varname : length(arraylength))

Page 41: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

41

What is the output?

#include <stdio.h>

int main(int argc, char* argv[]){ int data = 5; #pragma omp target map(tofrom:tofrom:data) { data += 2; }

printf("data: %d\n", data);

return 0;}

Please see memory.c

Page 42: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

42

What is the output?

● A) data: 5● B) data: 7● C) data: 2● D) Error or

segmentation fault● E) None of the above

#include <stdio.h>

int main(int argc, char* argv[]){ int data = 5; #pragma target map(tofrom:tofrom:data) { data += 2; }

printf("data: %d\n", data); return 0;}

Page 43: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

43

What is the output?

#include <stdio.h>

int main(int argc, char* argv[]){

int data = 5;

#pragma omp target map(tofrom:tofrom:data) { data += 2; }

printf("data: %d\n", data); return 0;}

● A) data: 5● B) data: 7● C) data: 2● D) Error or

segmentation fault● E) None of the above

Explanation: default for map(data) ismap(tofrom:data)

Page 44: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

44

What is the output?

#include <stdio.h>

int main(int argc, char* argv[]){ int data = 5;

#pragma omp target map(totofrom:data) { data += 2; }

printf("data: %d\n", data); return 0;}

● A) data: 5● B) data: 7● C) data: 2● D) Error or

segmentation fault● E) None of the above

Page 45: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

45

What is the output?

#include <stdio.h>

int main(int argc, char* argv[]){ int data = 5; #pragma omp target map(totofrom:data) { data += 2; }

printf("data: %d\n", data); return 0;}

Explanation: data points to uninitialized memory on the device when 2 is added..

● A) data: 5● B) data: 7● C) data: 2● D) Error or

segmentation fault● E) None of the above

Page 46: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

46

What is the output?

● A) data: 5● B) data: 7● C) data: 2● D) Error or

segmentation fault● E) None of the above

#include <stdio.h>

int main(int argc, char* argv[]){

int data = 5;

#pragma omp target map(tofromfrom:data) { data += 2; }

printf("data: %d\n", data); return 0;}

Page 47: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

47

What is the output?

● A) data: 5● B) data: 7● C) data: 2● D) Error or

segmentation fault● E) None of the above

#include <stdio.h>

int main(int argc, char* argv[]){

int data = 5; #pragma omp target map(tofromfrom:data) { data += 2; }

printf("data: %d\n", data);

return 0;}

Explanation: data is changed to 7 on the device, but the modified data is never copied back to the host.

Page 48: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

48

What is the output?

● A) data: 5● B) data: 7● C) data: 2● D) Error or

segmentation fault● E) None of the above

#include <stdio.h>

int main(int argc, char* argv[]){ int data = 5;

#pragma omp target map(tofrom:data)map(tofrom:data) { data += 2; }

printf("data: %d\n", data); return 0;}

Page 49: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

49

What is the output?

● A) data: 5● B) data: 7● C) data: 2● D) Error or

segmentation fault● E) None of the above

#include <stdio.h>

int main(int argc, char* argv[]){ int data = 5;

#pragma omp target map(tofrom:data)map(tofrom:data) { data += 2; }

printf("data: %d\n", data); return 0;}

Explanation: A variable referenced in a target construct that is not declared in the construct is implicitly treated as if it had appeared in a mapclause with a map-type of tofrom.

Page 50: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

50

Important Points about Memory

● Device memory is different than host memory– Device memory not accessible to host code

– Host memory not accessible to device code

● Data is copied to and from the device using pragmas (offload mode) or scp (native mode)

● Some programming models may use a virtual shared memory

Page 51: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

51

Exercise 6: Offload Programming

● The file offload.c is an OpenMP CPU program● Compile and run:

– $ icc ­o offload ­openmp ­O0 offload.c

● Modify this program to use the phi card– Use #pragma omp declare target so some_work()is compiled for

execution on the Phi

– Write an appropriate pragma to offload the some_work() call in main() to the Phi device, using the correct map clauses for transferring in_array and out_array.

● Compile and run the program● Try: export OFFLOAD_REPORT=3

– run your program again (no need to re-compile)

Page 52: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

52

Exercise 7: Environment Variables in Offload Programming

● Compile and run hello_offload.c, and hello.c

– $ icc ­openmp ­o hello_offload hello_offload.c; ./hello_offload

– $ icc ­openmp ­o hello hello.c; ./hello

● How many OpenMP threads are used by each (default)?● Change OMP_NUM_THREADS

– $ export OMP_NUM_THREADS=4

● Now how many OpenMP threads are used by each?● Set a different value for OMP_NUM_THREADS for offload execution:

– $ export MIC_ENV_PREFIX=MIC

– $ export MIC_OMP_NUM_THREADS=5

● Now how many OpenMP threads are used by each?● Note: Offload execution copies your environment variables unless you have one or more environment

variables beginning with $MIC_ENV_PREFIX_

– Otherwise, only copies MIC environment variables

● Note: Set variables for a specific coprocessor: $ export MIC_1_OMP_NUM_THREADS=5

Page 53: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

53

How should we compile/run this for offload execution?

● A) icc -mmic -openmp code.c; ./a.out

● B) icc -openmp code.c; ./a.out

● C) icc -mmic -openmp code.c; micnativeloadex a.out

● D) icc -openmp code.c; micnativeloadex a.out

● E) None of the above

#include <stdio.h>

int main(int argc, char* argv[]){ int data = 5;

#pragma omp target { data += 2; }

printf("data: %d\n", data); return 0;}

Page 54: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

54

How should we compile/run this for offload execution?

● A) icc -mmic -openmp code.c; ./a.out

● B) icc -openmp code.c; ./a.out

● C) icc -mmic -openmp code.c; micnativeloadex a.out

● D) icc -openmp code.c; micnativeloadex a.out

● E) None of the above

#include <stdio.h>

int main(int argc, char* argv[]){ int data = 5;

#pragma omp target { data += 2; }

printf("data: %d\n", data); return 0;}

Page 55: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

55

How should we compile/run this for native execution?

● A) icc -mmic -openmp code.c; ./a.out

● B) icc -openmp code.c; ./a.out

● C) icc -mmic -openmp code.c; micnativeloadex a.out

● D) icc -openmp code.c; micnativeloadex a.out

● E) None of the above

#include <stdio.h>

int main(int argc, char* argv[]){ int data = 5;

#pragma omp target { data += 2; }

printf("data: %d\n", data);

return 0;}

Page 56: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

56

How should we compile/run this for native execution?

● A) icc -mmic -openmp code.c; ./a.out

● B) icc -openmp code.c; ./a.out

● C) icc -mmic -openmp code.c; micnativeloadex a.out

● D) icc -openmp code.c; micnativeloadex a.out

● E) None of the above

#include <stdio.h>

int main(int argc, char* argv[]){ int data = 5;

#pragma#pragma omp targetomp target { data += 2; }

printf("data: %d\n", data); return 0;}

For target pragmas the compiler gives a warning but they are ignored for native-execution programs.

Page 57: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

57

How should we set OMP_NUM_THREADS for native

execution?● A)

– $ export OMP_NUM_THREADS=240

● B)– $ export MIC_ENV_PREFIX=MIC

– $ export MIC_OMP_NUM_THREADS=240

– ./a.out

● C) – $ micnativeloadex ./a.out ­e “OMP_NUM_THREADS=240”

Page 58: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

58

How should we set OMP_NUM_THREADS for native

execution?● A)

– $ export OMP_NUM_THREADS=240

● B)– $ export MIC_ENV_PREFIX=MIC

– $ export MIC_OMP_NUM_THREADS=240

– ./a.out

● C) – $ micnativeloadex ./a.out ­e “OMP_NUM_THREADS=240”

Page 59: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

59

Memory Persistence

● Data transfers to/from the device are expensive and should be minimized

● By default variables are allocated at beginning of offload segment, and free'd at the end

● Data can persist on the device between offload segments

● Must have a way to prevent freeing and re-allocation of memory if we wish to reuse it

Page 60: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

60

Memory Persistence//Allocate the arrays only once on target#pragma omp target data map(to:in_data[:SIZE]) \                        map(from:out_data[:SIZE]){  for(i=0;i<N;i++){    //Do not copy data inside of loop    #pragma omp target       {           ...offload code...       }    // Copy out_data from target to host    #pragma omp target update from(out_data[:SIZE])    // do something with out_data  }}

Page 61: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

61

Memory Persistence!Allocate the arrays only once on target!$OMP target data map(to:in_data(:SIZE)) map(from:out_data(:SIZE))

DO i=1,N  !Do not allocate or free on target inside of loop  !$OMP target           ...offload code...  !$OMP end target

  !Copy out_data from target to host  !$OMP target update from(out_data(:SIZE))  !do something with out_dataEND DO

!$OMP end target data

Page 62: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

62

Exercise 8: Memory Persistence

● Modify your solution to exercise 5 (offload.c) or copy the solution offload_soln.c from the solutions directory

● We would like to transfer, allocate, and free memory for in_array and out_array only once, instead of once per iteration

Page 63: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

63

Vectorization

● Two main requirements to achieve good performance– Multithreading

– Vectorization

● Vectorization - Compiler interprets a sequence of steps (e.g. a loop) as a single vector operation

● Xeon Phi has 512 bit-wide (16 floats) SIMD registers for vectorized operations

for (i=0; i<4; i++) c[i] = a[i] + b[i];

Page 64: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

64

Vectorization

(c) 2013 Jim Jeffers and James Reinders, used with permission.

Page 65: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

65

Vectorization

● Use ­qopt­report[=n] ­qopt­report­phase=vec (replaces ­vec­report[=n]) compiler option to get a vectorization report

● If your code doesn't automatically vectorize, you must communicate to the compiler how to vectorize– Use array notation (e.g. Intel Cilk Plus)

– Use #pragma omp simd (carefully)

● Avoid– Data dependencies

– Strided (non-sequential) memory access

Page 66: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

66

Intel Cilk Plus

● C/C++ language extensions for multithreaded programming● Available in Intel compilers (>= composer XE 2010) and gcc (>= 4.9)● Keywords

– cilk_for - Parallel for loop

– cilk_spawn - Execute function asynchronously

– cilk_sync - Synchronize cilk_spawn'd tasks

● Array notation– array-expression[lower-bound : length : stride]

– C[0:5:2][:] = A[:];

● #pragma simd– Simplest way to manually vectorize a code segment

● More information: – https://www.cilkplus.org/

Page 67: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

67

Exercise 9: Vectorization

● Compile offload_novec.c with vectorization reporting

– $ icc ­openmp ­o offload_novec ­qopt­report=3 ­qopt­report­phase=vec offload_novec.c sini.c

● Note that the loop around the call to sini() does not vectorize. Why?● We know that this algorithm should vectorize (try compiling offload_soln.c with ­qopt­report=3 ­qopt­report­phase=vec)

● Put a simd clause behind for in the omp parallel for pragma before the loop and recompile. Alternative: use the ­ipo switch.

Page 68: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

68

MPI on Xeon Phi

● Message Passing Interface (MPI) is a popular parallel programing standard– especially useful for parallel programs using multiple nodes

● There are three main ways to use MPI with Xeon– Native mode - directly on the device

– Symmetric mode - MPI processes run on both the CPU and the Xeon Phi

– Offload mode - MPI used for inter-node communication, code portions offloaded to Xeon Phi (e.g. with OpenMP)

Page 69: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

69

Exercise 10: Native MPI on Xeon Phi

● Setup your environment for Intel MPI on Xeon Phi– $ module add intel_mpi

– $ export I_MPI_MIC=enable

● Compile hello.c for native execution

– $ mpiicc ­mmic ­o hello.MIC hello_mpi.c

● Run with mpirun (executed on host)– $ I_MPI_FABRICS=shm mpirun ­n 60 ­host mic0 ./hello.MIC

– Normally use fewer MPI processes per node without explicit shm setting

Page 70: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

70

Exercise 11: Symmetric MPI on Xeon Phi

● Use the same environment described in exercise 9● Compile binaries for MIC and CPU

– $ mpiicc ­mmic ­o hello.MIC hello_mpi.c

– $ mpiicc ­o hello hello_mpi.c

● Intel MPI must know the difference between MIC and CPU binaries– $ export I_MPI_MIC_POSTFIX=.MIC

● Run with mpirun– $ mpirun ­perhost 1 ­n 2 ­host localhost,mic0 ./hello

– Or without I_MPI_MIC_POSTFIX=.MIC

– $ mpirun ­host localhost ­n 3 ./hello : ­host mic0 ­n 10 ./hello.MIC

– Use export I_MPI_FABRICS=shm:tcp in case of issues.

Page 71: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

71

Symmetric mode load-balancing

● Phi tasks will run slower than host tasks● Programmer's responsibility to balance workloads

between fast and slow processors

Page 72: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

72

Optimizing for Xeon Phi

● General tips– Optimize for the host node Xeon processors first

– Expose lots of parallelism● SIMD● Vectorization

– Minimize data transfers

– Try different numbers of threads from 60 to 240– Try different thread affinities

● e.g. OpenMP: –$ export MIC_ENV_PREFIX=MIC–$ MIC_OMP_PLACES=threads/cores

Page 73: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

73

KMP_AFFINITY/OMP_PLACESOMP_PLACES=threadsorKMP_AFFINITY=compact:

KMP_AFFINITY=scatter(default):

KMP_AFFINITY=balanced

Likely to leavecores unused

Neighbouring threadson the same core - more efficient cacheutilization

Neighbouring threadson different cores - do not share cache

Page 74: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

74

Identifying Accelerator Algorithms

● SIMD Parallelizability– Number of concurrent threads (need dozens)

– Minimize conditionals and divergences

● Operations performed per datum transferred to device (FLOPs/GB)– Data transfer is overhead

– Keep data on device and reuse it

Page 75: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

75

Identifying Accelerator Algorithms

● SIMD Parallelizability– Number of concurrent threads (need dozens)

– Minimize conditionals and divergences

● Operations performed per datum transferred to device (FLOPs/GB)– Data transfer is overhead

– Keep data on device and reuse it

Page 76: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

76

Which algorithm gives the most Phi performance boost?

Put the following in order from least work per datum to most:

● i) matrix-vector multiplication

● ii) matrix-matrix multiplication

● iii) matrix trace (sum of diagonal elements)

● A) i, ii, iii● B) iii, i, ii● C) iii, ii, i● D) i, iii, ii● E) They are all about

the same

Page 77: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

77

Which algorithm gives the most Phi performance boost?

Put the following in order from least work per datum to most:

● i) matrix-vector multiplication

● ii) matrix-matrix multiplication

● iii) matrix trace (sum of diagonal elements)

● A) i, ii, iii● B) iii, i, ii● C) iii, ii, i● D) i, iii, ii● E) They are all about

the same

Page 78: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

78

Which algorithm gives the most Phi performance boost?

● Matrix trace– assume you naively transfer the entire matrix to the

device

– Work Data

● Matrix vector multiplication– Work Data

● Matrix-matrix multiplication– Work Data

Page 79: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

79

Choosing a mode

Study optimized CPU run time

Study native modescaling (30, 60,

120, 240 threads)

Is native mode

(at any scaling)faster than the

CPU?

Consider nativemode

Arethere functions which execute

faster onphi?

Consider offloadingthose functions

Consider runningon CPU only

Work per datum

benefit > cost?

Collect CPU and native mode

profiling data

Yes

No

Yes

Yes

No

No

Page 80: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

80

Review

● We learned how to:– Gain access to the Xeon Phis through Guillimin's scheduler

– Log in and explore the Xeon Phis operating system

– Compile and run parallel software for native execution on the Xeon Phis

– Compile and run parallel software for offload execution on the Xeon Phis

● Offload pragmas (target, map)● Data persistence (target data)

– Ensure your code vectorizes for maximum performance

– Choose when to use the Xeon Phi and which mode to use

Page 81: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

81

Keep Learning...

● Xeon Phi documentation, training materials, example codes:– http://software.intel.com/en-us/mic-developer

– https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-applications-and-solutions-catalog

● General parallel programming:– http://openmp.org/wp/

– http://www.mpi-forum.org/

– http://cilkplus.org

– http://www.hpc.mcgill.ca/index.php/training

● Xeon Phi Tutorials:– http://software.intel.com/en-us/articles/intelr-xeon-phitm-advanced-workshop-labs

– http://www.drdobbs.com/parallel/programming-intels-xeon-phi-a-jumpstart/240144160?pgno=1

– https://www.cac.cornell.edu/vw/mic/

● Questions:– http://software.intel.com/en-us/forums/intel-many-integrated-core

– http://stackoverflow.com

[email protected]

Page 82: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

82

What Questions Do You Have?

[email protected]

Page 83: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

83

Bonus Topics (Time permitting)

Page 84: Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca … · 1 Intel Xeon Phi Workshop Bart Oldeman, McGill HPC bart.oldeman@mcgill.ca guillimin@calculquebec.ca May 7, 2015

84

More Xeon Phi practice

● Intel Math Kernel Library (MKL) examples– $ cp $MKLROOT/examples/* .

– $ tar ­xvf examples_mic.tgz

● Compile and run Hybrid MPI+OpenMP for native and offload execution– misc/hybrid_mpi_omp_mv4.c

● Compile and run OpenCL code for offload execution– misc/vecadd_opencl.c