Intel® Software Conference 2015 Intel® Parallel Studio XE 2016 Composer Edi:on
Intel® Software Conference 2015
Intel® Parallel Studio XE 2016 Composer Edi:on
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Introduction Overview, schedule, generic changes for Parallel Studio XE 2016
2
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 3
Intel® Parallel Studio XE 2016 – Components Tools Composer Edition Professional Edition Cluster Edition
Intel® C++ compiler ☑ ☑ ☑ Intel® Fortran compiler ☑ ☑ ☑
Intel® Math Kernel Library ☑ ☑ ☑ Intel® Threading Building Blocks library ☑ ☑ ☑
Intel® Integrated Performance Primitives ☑ ☑ ☑ Intel® Cilk™ Plus parallel model ☑ ☑ ☑
OpenMP* 4.0 ☑ ☑ ☑ Intel® Advisor XE ☑ ☑
Intel® Inspector XE ☑ ☑ Intel® VTune™ Amplifier XE ☑ ☑
Intel® Data Analytic Acceleration Library ( Intel® DAAL)
☑
☑
Intel® MPI library ☑ Intel® Trace Analyzer and Collector ☑
Rogue Wave IMSL* Library Bundled & Add-on Add-on Add-on
New !
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® Parallel Studio XE Composer Edition Intel® Parallel Studio XE Composer Edition
2015 Update 2 2016 Beta mid April 2015
2016 Release Q3/2015
Intel® C++ Compiler (ICC) 15.0 L:15.0.2.164 W:14.0.2.179
15.0 L:16.0.0.036 W:16.0.0.042
15.0 L:16.0.0.??? L:16.0.0.???
Intel® Math Kernel Library ( Intel® MKL )
11.2.2 11.3 11.3
Intel® Integrated Performance Primitives (Intel® IPP)
8.2 9.0 9.0
Intel® Threading Building Blocks (Intel® TBB)
4.3 4.4 4.4
Debug solution for Linux* Intel-extended GDB based on v. 7.8
Intel-ext. GDB based on 7.8
Intel-ext. GDB based on 7.xxx
§ Beta test just started ( April, WW14/15 )
§ Please contact presenter to join beta testing !
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Overall Support unchanged from Intel® Parallel Studio XE 2015
§ Linux* § Debian* 6 no longer supported § CentOS* 6.5 (64-bit only) Intel® Graphics Technology only § Redhat 5.x : Still supported but deprecated !
§ Windows* § Latent support for Windows* 10 and Microsoft Visual Studio 2015* § Microsoft Visual Studio 2013* Shell replaces Visual Studio 2010* Shell
§ OS X* § Latest OS X* and Xcode* supported - currently OS X* 10.10 (Yosemite) and Xcode* 6.x § 32-bit Mac hardware *not* supported § 32-bit Mac application development is supported
5
Operating System and IDE Support
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 6
Intel® Parallel Studio XE 2016 Layout
<Installation Directory>
compiler_and_libraries_<version>.<update>.<pkg>
documentation_<version>
ide_support_<version>
samples_<version>
debugger_<version>
parallel_studio_xe_<version>.<update>.<pkg_psxe>
system_studio_<version>.<update>.<pkg_iss>
inde_<version>.<update>.<pkg_inde>
advisor_<version>
inspector_<version>
trace_analyzer_and_collector_<version>
vtune_amplifier_<version>
Shared components
Consistent version, update, pkg ids
Symbolic links
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 7
Intel® Parallel Studio XE 2016 Layout
<Installation Directory> compiler_and_libraries_<version>.<update>.<pkg> <target OS>
bin <host arch>[_<target arch>]
compiler include <target_arch>
cilk
lib <target_arch>[_<target_os_subset>]
ipp
mkl
tbb
mpi
documentaion_<version>
ide_support_<version>
samples_<version>
debugger_<version>
parallel_studio_xe_<version>.<update>.<pkg_psxe>
Shared components
Sy
mbo
lic
links
Compiler and Libraries for specific target OSes
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® Parallel Studio XE 2016 employs new feature-based names
Feature-based names require issuing a new license file § Customer must obtain new license file to use Intel® Parallel Studio XE 2016
§ New license file supports all releases
§ Contains both new feature-based (i.e. supports Intel® Parallel Studio XE 2016) and previous Product-based feature codes (i.e. supports Intel® Parallel Studio XE 2015 and earlier)
Existing license file provide ongoing support for releases prior to Intel® Parallel Studio XE 2016
8
Licensing Changes New License File
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Changing Option-Names on Linux* starting by ”-o” Switch names starting with –o are renamed to start by –qo except “-o <object files>
§ Change done to be compatible to GCC and LLVM switch naming convention
§ However key new reason is popular ccache utility which doesn’t work with option names like “-openmp”
§ Really a design issue in ccache
9
Renaming Samples
Old Name New Name
-opt-report -qopt-report
-openmp -qopenmp
-opt-malloc -qopt-malloc
-offload -qoffload
§ Change process started by 16.0 compiler release:
§ Names w/o ‘q’ prefix still accepted
§ Release version might print deprection message
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
New Features of all Compilers OpenMP, vectorization, optimization reports
10
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Support for Features in OpenMP* 4.1 Technical Report 3 § Non-structured data allocation
§ omp target [enter | exit ] data § Asynchronous offload
§ nowait clause on omp task § Dependence (signal)
§ depend clause on omp task
§ Map clause extensions § Modifiers always and delete
Available for C/C++ and Fortran
Note: Standard not released yet – very likely in Q4/2015
11
OpenMP* 4.1 Extensions
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
simdlen (i.e. vectorlength) and safelen for loops § Usable with #pragma simd (Intel Cilk™ Plus) and omp simd (OpenMP*)
Array reductions § Fortran only (available in Beta update)
User-defined reductions § Supported for parallel in C/C++ for POD types. No support for Fortran, SIMD, or non-POD types (C+
+)
omp-simd collapse(N) clause § Available in a Beta update
FP-model honoring for simd loops
12
Improvements in Vectorization Intel® Cilk™ Plus and OpenMP* 4.0
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Improvements in Vectorization Other internal improvements
13
Alignment analysis § Information propagation improved
§ __assume_aligned() fixed
Memory reference analysis § Resolved all “subscript/dereference too
complex” cases
§ More convoluted cases optimized to use vector loads
Improvements for AVX512 § conflict/compress/expand idioms
improved
Improved optimization reports
Uniformity analysis and handling § Scalar control flow and scalar
computations
§ Benefits to memory reference analysis
Local target control supported § Vectorization properly targeted, e.g.
#include <immintrin.h> void foo1(float *y, float *a, float *b, int n) { if ( _may_i_use_cpu_feature(_FEATURE_AVX2)) { for (int i=0; i < n; ++i) y[i] = a[i]*y[i] + b[i]; // use FMA } else { for (int i=0; i < n; ++i) y[i] = a[i]*y[i] + b[i]; } }
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
§ Syntax: C++
Fortran
§ BLOCK_LOOP enables greater control over optimizations on specific DO/for loop inside a nested loop
§ Uses loop blocking technique to separate large iteration counted loops into smaller iteration groups
§ Smaller groups can increase efficiency of cache space use and augment performance
§ Works seamlessly with other directives including SIMD
14
Loop Blocking Pragma/Directive
#pragma block_loop [clause[,clause]...] #pragma noblock_loop
!DIR$ BLOCK_LOOP [clause[[,] clause]...] !DIR$ NOBLOCK_LOOP
clause: factor ( expr ) level ( levels ) private ( var1 [,var2 ]...
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 15
Loop Blocking Sample #pragma block_loop factor(250) level(2) for (i=0; i < m; i++) {
for (j=0; j < m; j++) { c[i]+=a[i][j]*b[j]; } }
for (jj=0;jj<m/250+1;jj++) { for (i=0; i < m; i++)
{ for (j=jj*250; j < min((jj+1)*250,m);j++)
{ c[i] += a[i][j]*b[j];
} } }
Original Source Code:
Outline of code after compiler loop transformations: Note: It is not always safe to interchange the iteration variables due to dependencies between statements for the order they execute. This safety check will be performed by the compiler !
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Syntax: ü C++ #pragma omp ordered [simd] newline or #pragma simdoff structured code block
ü Fortran !$omp ordered [simd] structured code block !$omp end ordered
Semantics: ü The ordered with simd clause construct specifies a structured block in the simd loop or SIMD function that
will be executed in the order of the loop iterations or sequence of call to SIMD functions. Rules: ü #pragma simdoff/#pragma omp ordered simd is only allowed inside a SIMD loop or SIMD-enabled function. ü A simdoff region must be a single-entry and single-exit code block ü The strict ordered execution is only guaranteed for the block itself
ü Execution remains weakly ordered w.r.t. to outside of the block or other ordered blocks ü Data dependencies between statements of the same block will be correctly resolved ü Other non-vector dependencies originating in ordered block still lead to undefined behavior
16
Ordered Blocks in SIMD Contexts
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Ordered Examples
17
for _Simd (i = 0; i < N; i++) { ... #pragma simdoff { a[indices[i]] += b[i];// index conflict } ... #pragma simdoff { if (c[i] > 0) q[j++] = b[i]; // compress } ... #pragma simdoff { lock(L) // atomic update if (x > 10) x = 0; unlock(L) } ... #pragma simdoff { a[indices[i]] += b[i];// still OK } }
for _Simd (i = 0; i < N; i++) { ... #pragma simdoff { if (c[i] > 0) q[j++] = b[i]; // compress } ... #pragma simdoff { if (c[i] > 0) // Order will change q[j++] = d[i]; // compared to serial } }
OK: Not OK:
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
+-
Fighting Data Dependencies inside Loop Using #pragma simdoff and array reductions
#pragma simd for(int i=0; i < VL; i++) { … val = values[i]; grp = groups[i]; #pragma simdoff // index conflict { g_total[grp] += val; } … }
0 3 2 3 0 2 1 2
5 7 8 9 3 6 5 3
5 0 0 0 3 0 0 0 0 0 0 0 0 0 5 0 0 0 8 0 0 6 0 3 0 7 0 9 0 0 0 0
8 5 17 16 g_
tota
l Pr
ivat
e co
pies
redu
ce
Solution: array reductions grp : val :
+=
#pragma simd reduction(+:g_total) for(int i=0; i < VL; i++) { … val = values[i]; grp = groups[i]; g_total[grp] += val; … }
0 0 0 0 0
0 0 0 0 0 0 0
0 3 2 3 0 2 1 2 grp (indices):
5 7 8 9 3 6 5 3 v (values):
8 5 17 16 g_total:
+= +=
? ? ? 8 14 7 5
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Adjacent Gathers
19
§ Three basic forms § Support for non-unit-strided accesses § Support for indirect accesses § Support for stencil codes
§ Replace series of gathers with a series of vector loads and sequence of permutes
§ In stencil case reduce number of gathers/loads
§ Support is case-driven: important cases get priority § Not a generic solution § Simple cases supported in 15.0 § Much more in 16.0
§ Submit your important cases !!
for (int i=start_idx i<end_idx; i++) { TYPE acc_x_0 = 0, acc_y_0 = 0, acc_z_0 = 0; for (int j=b_start_idx; j<b_end_idx; j+=1) { TYPE dt_x_0 = In_X[(j+0)] - In_X[i]; TYPE dt_y_0 = In_Y[(j+0)] - In_Y[i]; TYPE dt_z_0 = In_Z[(j+0)] - In_Z[i]; acc_x_0 += s_0*dt_x_0; acc_y_0 += s_0*dt_y_0; acc_z_0 += s_0*dt_z_0; } Out_V[3*(i+0)+0] += delta_t * acc_x_0; Out_V[3*(i+0)+1] += delta_t * acc_y_0; Out_V[3*(i+0)+2] += delta_t * acc_z_0; }
for (int k = 0; k < numneighs; k++) { const int j = neighs[k]; double x = x[j * PAD + 0]; double y = x[j * PAD + 1]; double z = x[j * PAD + 2]; … }
do 2900 i=ibeg+iriter,iend,4 do 2070 j=jbeg-1,jend+1 . . . dqm = (v1(i,j,k) - v1(i,j-1,k)) * dx2bi(j) dqp = (v1(i,j+1,k) - v1(i,j,k)) * dx2bi(j+1) dq(j,1) = max (dqm * dqp, zro) * sign (one, dqm + dqp) / max(abs(dqm + dqp ), tiny) . . .
Don’t rely too much on compiler: use SoA layout if possible
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
How it works
20
for (int i = 0; i < size; ++i) { for (int j = i + 1; j < size; ++j) { xij = xi - data[3 * j]; yij = yi - data[3 * j + 1]; zij = zi - data[3 * j + 2];
data
Regs:
data[3*j]
data[3*j+1]
data[3*j+2]
Regs: 3 gathers
loads
Permutes
blends and shuffles
From 30% to 48% speed-up on KNC for size equal to 10000
KNC sequence is: • 3 pairs of loadunpacklpd/loadunpackhpd • 6 cross-lane permutations • 5 blends • 2 in-lane shuffles or swizzles
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® Advisor XE - Vectorization Advisor Data Driven Vectorization Design
21
Have you: § Recompiled with AVX2, but seen little benefit? § Wondered where to start adding vectorization? § Recoded intrinsics for each new architecture? § Struggled with cryptic compiler vectorization
messages?
Breakthrough for vectorization design § What vectorization will pay off the most? § What is blocking vectorization and why? § Are my loops vector friendly? § Will reorganizing data increase performance? § Is it safe to just use pragma simd?
More Performance Fewer Machine Dependencies
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
A modified copy of a source code file with each line numbered and compiler diagnostics inserted after correspondent lines
The listing file can be generated in either a plain text or html format
Example:
22
Annotated Source Listing (ASL) (available in Beta Update 1)
1 int* foo(int* a, int* b, int upperbound){ 2 3 int* c = new int[upperbound]; 4 #pragma omp parallel for OpenMP DEFINED LOOP WAS PARALLELIZED 5 for (int i = 0; i < upperbound; ++i) { LOOP BEGIN at Test/library.cpp(5,2) <Peeled> LOOP END LOOP BEGIN at Test/library.cpp(5,2) remark #25460: No loop optimizations reported LOOP END 7 c[i] = a[i] + b[i]; 6 } 7 return c; 8 }
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
[/Q | -q]opt-report-annotate=[text | html]
§ Enable annotated source listing using specified format (Default: Disabled)
§ When enabled without format specification, format defaults to: text
[/Q | -q]opt-report-annotate-position=[caller | callee | both]
§ Enable annotated source listing and specify site where optimization messages appear for inlined cases of loop optimizations (Default: Disabled)
§ When enabled without position specification, site defaults to: caller
23
Annotated Source Listing (ASL) Compiler Options
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
What’s New in Intel® Fortran Compiler XE 2016 Intel® Fortran Compiler 16.0
24
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Submodules from Fortran 2008
IMPURE ELEMENTAL from Fortran 2008
Further C Interoperability from Fortran 2015
Other New Features
§ ASYNCHRONOUS communication
§ -fpp-name option
§ VS2013 Shell
§ Uninitialized Variable Run-time Detection
25
New and Changed Features
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 26
Submodules (F2008) – The Problem
! Source source1.f90 use bigmod … Call sub1
! Source source2.f90 use bigmod … x = func2(…)
! Source source47.f90 use bigmod … call sub47
module bigmod … contains subroutine sub1 …<implementation of sub1> function func2 …<implementation of func2> subroutine sub47 …<implementation of sub47> … … end module bigmod
Some edit
Recompile
Recompile
Recompile
Recompile
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Changes in the submodule do not force recompilation of uses of the module – as long as the interface does not change
27
Submodules (F2008) – The Solution
module bigmod … interface module subroutine sub1 … module function func2 … module subroutine sub47 … end interface end module bigmod
submodule (bigmod) bigmod_submod contains module subroutine sub1 … <implementation of sub1> module function func2 … <implementation of func2> module subroutine sub3 … <implementation of sub3> end submodule bigmod_submod
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
TS29113 on “Further Interoperability of Fortran with C” to be part of Fortran 2015. Motivations include:
§ Support needed for MPI-3 Fortran2008 language binding
§ See chapter 17.1 of MPI 3.0 Standard
§ Provide Fortran equivalent of C’s “void*” – assumed type and rank
§ Enable C code to manipulate array descriptors
§ Extend interoperable interfaces to ALLOCATABLE, POINTER, OPTIONAL, assumed shape, character assumed length
§ Extend ASYNCHRONOUS attribute beyond I/O
§ Relaxed restrictions
28
Further C Interoperability (F2015)
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
C-Descriptor Structure - CFI_cdesc_t
§ The new functionality requires the C-Code to have access to all the components describing the Fortran objects (arrays in general)
§ A structure containing (simplified) :
Type & Name Value void * base addr Base address of object size t elem len Storage size of a single element int version CFI_VERSION number CFI rank t rank Number of dimensions CFI type t type Number identifying the intrinsic, interoperable interface CFI attribute t attribute Identifies whether object is allocatable, a pointer etc CFI dim t dim[ ] For each dimension (rank) lower bound, extend and stride
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Sample : MPI-3 Fortran MPI_F08 Language Binding
SUBROUTINE MPI_Send( buf, count, datatype, dest, & tag, comm, ierror ) TYPE(*), DIMENSION(..), INTENT(IN) :: buf INTEGER, INTENT(IN) :: count, dest, tag TYPE(MPI_Datatype), INTENT(IN):: datatype TYPE(MPI_Comm), INTENT(IN) :: comm INTEGER, OPTIONAL, INTENT(OUT) :: ierror END FUNCTION MPI_Send … CALL MPI_Send(y(1::2,:), size(y(1::2,:),KIND=c_int), MPI_INT, dest, tag, MPI_COMM_WORLD )
Fortran interface for MPI_Send routine (as defined in the MPI_F08 module from MPI‑3.0)
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
ASYNCHRONOUS Attribute § Guarantee of correct asynchronous operations
§ Fortran has no pointer aliasing
§ Compilers tend to aggressively re-order code § Compiler can move the code xx=buf(…) above the MPI_Wait()
REAL :: buf(100,100) TYPE(MPI_Request) :: req TYPE(MPI_Status) :: status ... ! Code that involves buf BLOCK ASYNCHRONOUS :: buf CALL MPI_Irecv( buf, size(buf), MPI_REAL, src, tag, & MPI_COMM_WORLD, req ) ... ! Overlapped computation that does not involve buf CALL MPI_Wait( req, status ) xx = buf(2,3) ! Without ASYNCHRONOUS, compiler could ! move code before MPI_Wait call END BLOCK
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Uninitialized variable checking using [Q]init option is extended to local, automatic, and allocated variables of intrinsic numeric type
Example:
32
Uninitialized Variable Run-time Detection
4 real, allocatable, dimension(:) :: A 5 6 ALLOCATE(A(N)) 7 8 do i = 1, N 9 Total = Total + A(I) 10 enddo
$ ifort -init=arrays,snan -g -traceback sample.F90 -o sample.exe $ sample.exe forrtl: error (182): floating invalid - possible uninitialized real/complex variable. Image PC Routine Line Source ... sample.exe 0000000000402E12 MAIN__ 9 sample.F90 ... Aborted (core dumped)
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
OpenMP 4.1
• TARGET NOWAIT – current task may continue execution without waiting for the target to finish
• TARGET DEPEND – treated as if DEPEND had been specified for implicit TASK construct enclosing TARGET
[NO]BLOCK LOOP enables or disables loop blocking for following loop
New -fpp-name option
• Lets you supply your own fpp preprocessor
VS2013 Shell replaces VS2010 Shell on Windows
33
Other New Features
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
C/C++ Specific New Features Intel® C/C++ Compiler 16.0
34
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 35
Compile Time Improvements • Intrinsic headers emmintrin.h, immintrin.h etc provided by Intel
• Very large files - thousands of prototypes for SSE, AVX, … intrinsics like extern __m128 _mm_shuffle_ps(__m128, __m128,unsigned int);
extern __m128 _mm_unpackhi_ps(__m128, __m128);
• Opened and parsed before each compilation – takes much time !
• Prototypes are now automatically disabled in headers
• Use –D__INTEL_COMPILER_USE_INTRINSIC_PROTOTYPES to restore old behaviour (enhanced type checking)
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
New Feature Support § Unicode strings
§ C11 anonymous unions
New Keyword Support
_Generic Example
36
ANSI Standard C11 Standard Support
_Alignas _Alignof
_Static_assert _Thread_local
_Noreturn _Generic
#define pow(X) _Generic((X), long double: powl, \ default: pow, \ float: powf)(X)
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
§ Generic Lambdas
§ Generalized lambda captures
§ Digit Separators
§ [[deprecated]] attribute
§ Function return type deduction
§ Member initializers and aggregates
§ Feature test macros
37
C++14 Standard Support
Reference: C++14 FDIS http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3690.pdf
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 38
New of C++14
x
auto glambda = [] (auto a) { return a; };
Generalized lambda captures
int x = 4; int z = [&r = x, y = x+1] { r += 2; // set x to 6; "R is for Renamed Ref" return y+2; // return 7 to initialize z }(); // invoke lambda
Generic lambdas
Function return type deduction
auto foo(int i) { if (i ==1) return i; else return foo(i-1)+i; }
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
See Technical Report: https://isocpp.org/std/standing-documents/sd-6-sg10-feature-test-recommendations
39
C++14: Feature Test Macros
x
#if __has_include("shared_mutex") // use standard header here #elif __has_include("boost/shared_mutex.h“) // use BOOST header #endif
Test for existence of compiler feature:
#ifndef __cpp_constexpr // no constexpr functionality available #elif __cpp_constexpr == 200704 // c++11 constexpr functionality available #else // c++14 constexpr functionality available #endif
Test for existence of compiler feature:
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
GNU* compatibility § Enable C11 or C++14 support via options: –std=c11 –std=c++14
§ Supports same C++14 and C11 features except _Atomic (used in <stdexcept.h>) as GNU* 5.x (not released yet)
§ Standards support matches installed GNU* version (i.e. g++ in your PATH)
Microsoft* compatibility § Enable C11 or C++14 features (beyond Microsoft* reference compiler) via options:
/Qstd=c11 /Qstd=c++11 /Qstd=c++14
§ Supports same C++14 and C11 features as Microsoft* Visual C++ 2015 (not released yet)
§ Compatible by default; features match system’s reference Microsoft* compiler
40
GNU* and Microsoft* Compatibility
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
§ SIMD (two operand) operators support with SSE types
§ 128 and 256-bit SIMD types only § operands must be of same type
§ Compiler option control of honoring parentheses § -f[no-]protect-parens /Qprotect-parens[-]
Enable/disable (DEFAULT) optimizer honoring of parentheses around floating-point expressions (including complex and decimal)
41
Other New Features and Enhancements C/C++ [1]
+ - * / & | ^ += -= *=/= &= |= ^= == != > < >= <=
__m128i x,y,z; x = y + z;
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
§ Decimal floating point extension support now for Windows C/C++ too
§ See document “ISO/IEC JTC 1/SC 22/WG 14 N1912” at http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1912.pdf
§ C++ wrappers for Intel® AVX512 vector operations § Operations =, -, *, /, ^, |, &…), math (erf, sqrt, pow, …)
§ Integer classes (common AVX512 ISA): § M512vec § I[s,u]64vec8 (8 signed or unsigned 64-bit integers) § I[s,u]32vec16 (16 signed or unsigned 32-bit integers).
§ FP classes § F64vec8 (8 doubles) § F32vec16 (16 floats)
§ Integer classes (AVX512BW) § I[s,u]16vec32 (32 words) § I[s,u]8vec64 (64 bytes)
42
Other New Features and Enhancements C/C++ [2]
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
§ New combined Parallel + Vector loop
or
§ Combined loop gives both parallelism (using threads) and vectorization
§ Behaves approximately like this pair of nested loops
§ The chunk size, M, is determined by the compiler and runtime
43
Intel® Cilk™ Plus Combined Parallel/SIMD loops
_Cilk_for _Simd (int i = 0; i < N; ++i) // Do something
#pragma simd _Cilk_for (int i = 0; i < N; ++i) // Do something
_Cilk_for (int i_1 = 0; i_1 < N; i_1 += M) for _Simd (int i = i_1; i < i_1 + M; ++i) // same as #pragma simd // Do something
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Performance Libraries Intel® Math Kernel Library,
Intel® Integrated Performance Primitives,
Intel® Threading Building Blocks
44
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Additional Sparse Matrix Vector Multiplication API
§ new two stage API for Sparse BLAS level 2 and 3 routines
MKL-MPI wrappers
§ all MPI implementations are API-compatible but MPI implementations are not ABI-compatible
§ MKL-MPI wrapper solves this problem by providing an MPI-independent ABI to MKL
Optimized HPCG (High Performance Conjugate Gradients) benchmark
§ designed to be more representative of common application workloads
45
Intel® Math Kernel Library 11.3 Beta
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Support For Small Matrix multiplication
§ a single call executes independent ?GEMM operation simultaneously
Support for Philox4x35 and ARS5 RNG
§ two new pseudorandom number generators with a period of 2^128 are highly optimized for multithreaded environment
Sparse Solver SMP improvements
§ significantly improved overall scalability for Intel Xeon Phi coprocessors and scalability of the solving step for Intel Xeon processors
46
Intel® Math Kernel Library 11.3 Beta
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® Integrated Performance Primitives 9.0 Beta New Features
47
Additional optimization for Intel® processors with Intel® AVX2 instructions support
§ Intel® AVX2: computer vision, image processing optimization
New APIs to support external threading
New APIs to support external memory allocation
Improved CPU dispatcher § Auto-initialization. No need for CPU initialization call
in static libraries
§ Code dispatching based on CPU features
Optimized cryptography functions to support SM2/SM3/SM4 algorithm Custom dynamic library building tool
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 48
IPP Focus Areas and Supporting Domains
Cryptography 600 primitives
Data Compression 150 primitives
String Processing 100 primitives
Computer Vision 700 primitives
Color Correction 500 primitives Image
Processing 3500 primitives
Signal Processing
2100 primitives
Vector Math 400 primitives
IPP Core 25 primitives
Image Processing &
Computer Vision
Data Compression
String Processing
Cryptography
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 49
New Intel® IPP Package Structure
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
OpenCV at Glance: > 500 functions OCV only Covered by IPP Legend:
50
General Image processing functions Image Pyramids
Image Descriptors
Camera calibration, Stereo, 3D
Segmentation
Transforms
Features
Tracking
Utilities and Data Structures
Fitting Machine Learning: • Detection, • Recognition
Matrix Math Intel® IPP covers ~60%
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
q Intel® IPP for OpenCV (ICV) – subset of Intel® IPP. It contains about 750 functions that are integrated into OpenCV 3.0.
q OpenCV 3.0 turned on ICV usage be default for x86 configuration.
q ICV 8.2 gives 1.7x speed up (geometric mean) on Haswell and 1.6x on Baytrail vs original “plain” OpenCV
51
Performance vs. Promise OpenCV 3.0 Performance Increases with IPP Optimizations
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® Threading Building Blocks – What’s New Fully supported tbb::task_arena
§ Task arenas provide improved control over workload isolation and the degree of concurrency.
Dynamic replacement of standard memory allocation routines for OS X*.
§ Utilize the powerful TBB scalable allocator easily on OS X
Binary files for 64-bit Android* applications were added as part of the Linux* OS package.
Improvements to the Flow Graph features
§ Don’t forget to check out Flow Graph Designer
Several Improvements to examples and documentation
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
GFX Compiler Offload Compiler for Intel® HD Graphic
53
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Programming Model Features § Shared Virtual Memory (available in Beta update)
§ Some OpenMP* 4.0
§ Improved asynchronous programming support
Performance Improvements § Shared Local Memory
§ Tuned for 5th Generation Intel® Core™ processor
§ Improved vectorization for Gen target
Usability § gfx_sys_check tool
§ Improved Debugging support 54
Intel® Graphics Technology
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Adding ( some) OpenMP* 4.0 Offload Support bool Sobel::execute_offload() { int w = COLOR_CHANNEL_NUM * image_width; float *outp = this->output; float *img = this->image; int iw = image_width; int ih = image_height; #pragma omp target map(to: ih, iw, w) \ map(tofrom: img[0:iw*ih*COLOR_CHANNEL_NUM], \ outp[0:iw*ih*COLOR_CHANNEL_NUM]) #pragma omp parallel for collapse(2) for (int i = 1; i < ih - 1; i++) { for (int k = COLOR_CHANNEL_NUM; k < (iw - 1) * COLOR_CHANNEL_NUM; k++) { float gx = 1 * img[k + (i - 1) * w -1 * 4] + 2 * img[k + (i - 1) * w +0 * 4] + 1 * img[k + (i - 1) * w +1 * 4] - 1 * img[k + (i + 1) * w -1 * 4] - 2 * img[k + (i + 1) * w +0 * 4] - 1 * img[k + (i + 1) * w +1 * 4]; float gy = 1 * img[k + (i - 1) * w -1 * 4] - 1 * img[k + (i - 1) * w +1 * 4] + 2 * img[k + (i + 0) * w -1 * 4] - 2 * img[k + (i + 0) * w +1 * 4] + 1 * img[k + (i + 1) * w -1 * 4] - 1 * img[k + (i + 1) * w +1 * 4]; outp[i * w + k] = sqrtf(gx * gx + gy * gy) / 2.0; } } return true; }
Usability: -‐ Only a subset is supported -‐ ‘tofrom’ and ‘to’ maps to ‘pin’ -‐ -‐qopenmp-‐offload=gfx must be used to change
the compiler default omp target from MIC to GFX
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® MIC Architecture Enhancements for Offloading
56
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
New Features available for C/C++
§ Offload of structures with pointer members - enabling offload inside member function
§ Offload with MIC-only memory allocation - new modifiers targetptr and preallocated
§ Offload using Streams - new stream clause and associated APIs
Performance Improvements for C/C++ and Fortran
§ Asynchronous offload
§ Memory Allocation and Data Transfers
57
Offload Features for Intel® Xeon Phi™
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® Advanced Vector Extensions 512 (Intel® AVX-512) New instruction set extension for next generation Intel® MIC architecture ( code name Knights Landing – KNL) and future Intel® Xeon architecture
58
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
AVX-512 - Greatly increased Register File
XMM0-‐15 16-‐ bytes
YMM0-‐15 32 bytes
ZMM0-‐31 64 bytes
SSE AVX2
AVX-512
0
15
31
Vector Registers IA32 (32bit)
Intel64 (64bit)
SSE (1999)
8 x 128bit 16 x 128bit
AVX and AVX-2 (2011 / 2013)
8 x 256bit 16 x 256bit
AVX-512 (2014 – KNL)
8 x 512bit 32 x 512bit
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
The Intel® AVX-512 Subsets [1]
q Comprehensive vector extension for HPC and enterprise q All the key AVX-512 features: masking, broadcast… q 32-bit and 64-bit integer and floating-point instructions q Promotion of many AVX and AVX2 instructions to AVX-512 q Many new instructions added to accelerate HPC workloads
AVX-512 F: 512-bit Foundation instructions common between MIC and Xeon
q Allow vectorization of loops with possible address conflict q Will show up on Xeon
AVX-512 CD (Conflict Detection instructions)
q fast (28 bit) instructions for exponential and reciprocal and transcendentals ( as well as RSQRT) q New prefetch instructions: gather/scatter prefetches and PREFETCHWT1
AVX-512 extensions for exponential and prefetch operations
AVX-512 F
AVX-512CD
AVX-512ER
AVX-512PR
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
The Intel® AVX-512 Subsets [2]
q All of (packed) 32bit/64 bit operations AVX-512F doesn’t provide q Close 64bit gaps like VPMULLQ : packed 64x64 è 64 q Extend mask architecture to word and byte (to handle vectors) q Packed/Scalar converts of signed/unsigned to SP/DP
AVX-512 Double and Quad word instructions
q Extent packed (vector) instructions to byte and word (16 and 8 bit) data type q MMX/SSE2/AVX2 re-promoted to AVX512 semantics
q Mask operations extended to 32/64 bits to adapt to number of objects in 512bit q Permute architecture extended to words (VPERMW, VPERMI2W, …)
AVX-512 Byte and Word instructions
q Vector length orthogonality q Support for 128 and 256 bits instead of full 512 bit
q Not a new instruction set but an attribute of existing 512bit instructions
AVX-512 Vector Length extensions
AVX-512DQ
AVX-512BW
AVX-512VL
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Other New Instructions
q Set of instructions to implement checking a pointer against its bounds q Pointer Checker support in HW ( today a SW only solution of e.g. Intel Compilers ) q Debug and security features
Intel® MPX – Intel Memory Protection Extension
q Fast implementation of cryptographic hashing algorithm as defined by NIST FIPS PUB 180
Intel® SHA – Intel Secure Hash Algorithm
q needed for future memory technologies
Single Instruction – Flush a cache line
MPX
SHA
CLFLUSHOPT
Save and restore extended processor state XSAVE{S,C}
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
AVX-512 – KNL and future XEON § KNL and future Xeon architecture share
a large set of instructions § but sets are not identical
§ Subsets are represented by individual feature flags (CPUID)
Future Xeon Phi (KNL)
SSE*
AVX
AVX2*
AVX-512F
Future Xeon
SSE*
AVX
AVX2
AVX-512F
SNB
SSE*
AVX
HSW
SSE*
AVX
AVX2
NHM
SSE*
AVX-512CD AVX-512CD
AVX-512ER
AVX-512PR AVX-512BW
AVX-512DQ
AVX-512VL
MPX,SHA, …
Com
mon
Inst
ruct
ion
Set
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 64
Intel® Compiler Processor Switches
Switch Description -xmic-avx512 KNL only; already in 14.0 -xcore-avx512 Future XEON only, already in 15.0.1 -xcommon-avx512 AVX-512 subset common to both, already in
15.0.2 -m, -march, /arch Not yet ! -ax<…-avx512> Same as for “-x<…-avx512>” -mmic No – not for KNL
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Memory Model of next Generation Intel® MIC Architecture Code Name Knights Landing ( KNL)
65
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 66
KNL Memory Modes
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
• API is open-sourced (BSD licenses)
• https://github.com/memkind
• User jemalloc API underneath • http://www.canonware.com/jemalloc/ • https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-
jemalloc/480222803919
Malloc replacement:
67
High Bandwidth On-Chip Memory API
#include <memkind.h> hbw_check_available() hbw_malloc, _calloc, _realloc,… (memkind_t kind, …) hbw_free() hbw_posix_memalign() hbw_get_size(), _psize() ld … -ljemalloc –lnuma –lmemkind –lpthread
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Debugging Intel Extended Gnu Debugger (GDB-IA)
68
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Enhancements to Intel® version of GDB, the GNU* Project Debugger (Linux* only)
§ Improved OpenMP* support (tasks, task dependencies, teams & barriers)
§ Added Fortran intrisinc support (e.g. ASSOCIATED, ALLOCATED, UBOUND, …)
Improved debugging support for Intel® Graphics Technology
69
Debugging Enhancements
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Summary / Call to Action
70
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Register and Try our Beta Release § Visit:
Send Feedback! § Report issues via Intel Premier - https://premier.intel.com/
§ Please participate in our Beta Surveys, we value all comments!
Remember: New 2016 versions of all Intel® Parallel Studio XE tools – not only compiler and libraries !
71
Next Steps
bit.ly/psxe2016beta
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
73