Expressing vectorization Milind Girkar Intel Corp.
Expressing vectorization
Milind Girkar
Intel Corp.
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel’s Multi-Core and Many Integrated Core Products
The Goals of Intel® Architecture are to deliver:
• Industry leading performance/watt for serial & highly parallel workloads
• Optimized Efficiency for a Heterogeneous Solution in combination with Intel® Xeon® processors
• Complete set of software tools to deploy scalable solutions efficiently
Many Core Intel® Xeon® Phi™ coprocessor at 1-1.2 GHz
Multi-core Intel® Xeon® processor at 2.26-3.5 GHz
2
10/3/2013
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Overview
Multiple Levels of Parallelism
SIMD
Node: Threading, and Threading Help
Mixing CPU and Intel® Xeon Phi™
Cluster Analysis
3
10/3/2013
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Hotspot analysis results
4
10/3/2013
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
CPU HW Sampling results
5
10/3/2013
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Source View
6
10/3/2013
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Inspector XE 2011 Memory and Thread Analysis
Memory Analysis finds • Memory leaks and memory corruption • Memory allocation & de-allocation API mismatches • Inconsistent memory API usage
Thread Analysis finds • Data races and Deadlocks • Thread and sync APIs used • Latent bugs within increasing complex parallel
programs • Stack memory accesses by another thread
7
10/3/2013
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Locks analysis type - results
Sorted list of synch
objects causing the
most thread wait time
Clicking on a synch object displays the source
code for the acquisition of that object
8
10/3/2013
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Advanced Vector Extensions
Roadmap illustration - subject to change
Since 2001:
128-bit Vectors
AVX 1.0: 2X flops: 256-bit wide floating-point vectors
Half-float support, Random Numbers
AVX2: FMA (2x peak flops)
256-bit integer SIMD. “Gather” Instructions.
Sandy Bridge
(32 nm Tock)
Perf
orm
ance /
core
2010 2011 2012 2013
Ivybridge
(22nm Tick)
Haswell
(22 nm Tock)
Knights Landing
/Future Xeon
512- bit Vectors
32 registers
Masking, Broadcast Goal: 8X peak FLOPs over 4 generations
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
512b AVX-512 256b AVX2
in planning, subject to change
AVX-512
512-bit FP/Integer
32 registers
8 mask registers
Gather/Scatter
Embedded rounding
Embedded broadcast
Scalar/SSE/AVX “promotions”
HPC additions
Transcendental support
Gather/Scatter
AVX AVX2
256-bit basic FP
16 registers
NDS (and AVX128)
Improved blend
MASKMOV
Implicit unaligned
Float16 (IVB 2012)
256-bit FP FMA
256-bit integer
PERMD
Gather
SNB 2011
HSW 2013
Future Processors
Intel® AVX Technology
256b AVX1
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
11
float a[N], b[N]; int i;
for (i = 0; i < N; i++)
a[i] = a[i] + b[i];
loop: movaps xmm0, _a[eax] addps xmm0, _b[eax] movaps _a[eax], xmm0 add eax, 16 cmp eax, ecx jl loop
Vectorization Conversion of serial code into SIMD instructions that
simultaneously operate on multiple data elements.
loop: movss xmm0, _a[eax] addss xmm0, _b[eax] movss _a[eax], xmm0 add eax, 4 cmp eax, ecx jl loop
Scalar code,
one element per iteration
Vector code,
four elements per iteration
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
12
Vectorization Directives
Data dependence hints
• “restrict” keyword
• #pragma ivdep
Loop count hints
• #pragma loop count ( <int> )
• #pragma loop [min | max | avg] count (<int>)
Heuristic-related hints
• #pragma novector
• #pragma vector always
Data alignment directive
• C/C++
– Windows : __declspec(align(16)) float A[1000];
– Linux/MacOS: float A[1000] __attribute__ ((aligned (16));
• Fortran
– !DIR$ ATTRIBUTES ALIGN: 16:: A
Data alignment assertion (16B example)
• C/C++: __assume_aligned(p,16);
• Fortran: !DIR$ ASSUME_ALIGNED A(1):16
Aligned loop assertion
• C/C++: #pragma vector aligned
• Fortran: !DIR$ VECTOR ALIGNED
Aligned malloc
• _aligned_malloc()
• _mm_malloc()
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Motivation for Explicit Vectorization
• Most users do not rely on automatic threading/vectorization
• Subject to limits of compiler analysis capabilities…
• Most frequent reason: Dependence issues
• Many other potential reasons • Function calls in loop block • Complex control flow / conditional branches • Loop not “countable” • Not inner loop • Loop body too complex • Vectorization seems inefficient
• For threading, most users have moved to explicit expression
• OpenMP*, Intel® Cilk™, Intel® Threading Building Blocks, others
• So, Intel started the Intel® Cilk™Plus effort
• Some parts now standardized into OpenMP 4.0
*OpenMP is a trademark of the OpenMP Architecture Review Board.
13
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Explicit vectorization
#pragma simd
for (int k=0; k<foo(); k++) {
int t; // private
t = a[k] + b[k];
c[k] = t;
while () { …}
}
c[0:n] = a[0:n] + b[0:n];
• Treat it like a countable loop
• Local variables considered private.
• Reductions etc.
• Other structured control flow constructs
• Fortran array sections in C/C++
14
10/3/2013
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel Cilk™ Plus Language extensions
• SIMD functions
• Array Notation
• SIMD loops
• Cilk keywords
15
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
SIMD functions
//callee.c __declspec(vector) // CilkPlus #pragma omp declare simd // OpenMP 4.0 float myfunc(float x, float y) { return x*x + y*y; }
16
//SIMD version created by the compiler // __m128 == float4 __m128 myfunc$vec(__m128 vx, __m128 vy) { return vx*vx + vy*vy; }
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Using SIMD functions
SIMD function prototype
Array Notation
SIMD loops
Cilk keywords
// caller.c // caller knows that vector version of myfunc exists __declspec(vector) float myfunc(float x, float y); extern float c[N], a[N], b[N];
void caller1(int n) { c[0:n] = myfunc(a[0:n], b[0:n]); } void caller2(int n) { #pragma simd for (int k=0; k<n; k++) c[k] = myfunc(a[k], b[k]); }
void caller3(int n) { _Cilk_for (int k = 0; k < n ; k++) { c[k] = myfunc(a[k],b[k]); } }
17
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
18
10/3/2013
Supporting Xeon Phi as coprocessor
• Intel pragmas for offload supported in Intel
Parallel Studio XE 2013 for Linux
• Intel Compiler 14.0 supports the recently
standardized OpenMP* 4.0 device constructs
• OpenMP 4.0, Intel offload pragmas are broadly
equivalent
*The OpenMP name and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board.
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Topic OpenMP 4.0 Target Intel Pragma
Data placement
data clauses map(to,from,tofrom,alloc) in/out/inout alloc_if, free_if, nocopy
data construct target data offload transfer
update construct target update offload transfer
declare target directive declare target __declspec(target(mic))
free/alloc API routines malloc/free malloc/free
Code placement
Execution model host-centric, device executes the region
host-centric, device executes the region
Offloading region execution Target + [teams] target
Offloading loop/for construct target + team + distribute / parallel for / for
No restriction
Declare target call construct declare target function __declspec(target(mic)
Multiple device support of same type
device clause device clause, ICV device clause
API routines get/set dev num Device clause/env
Asynchronous/Synchronous control
re-use tasking and add task dependency(in | out | inout)
thread waits on device @task scheduling point
Signal/wait
array sections array sections array sections
OpenMP constructs
have equivalent
Intel Pragma constructs
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Example – Mandelbrot
#pragma omp declare target #pragma omp declare simd simdlen(16) uint32_t mandel(fcomplex c) { // Computes number of iterations(count variable) that it takes // for parameter c to be known to be outside mandelbrot set uint32_t count = 1; fcomplex z = c; for (int32_t i = 0; i < max_iter; i += 1) { z = z * z + c; int t = cabsf(z) < 2.0f; count += t; if (!t) { break;} } return count; }
#pragam omp target device(0) map(to:in_vals) map(from:count) #pragma omp parallel for schedule(guided) for (int32_t y = 0; y < ImageHeight; ++y) { #pragma omp simd for(int32_t x = 0; x < ImageWidth; ++x) { count[y][x] = mandel(in_vals[y][x]); } }
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Performance with Intel CilkPlus
21
Performance with Intel Cilk Plus
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See slide 22 for configurations and benchmark results disclaimer.
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
22
10/3/2013
Results have been measured by Intel based on software, benchmark or other data of third parties and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Intel does not control or audit the design or implementation of third party data referenced in this document. Intel encourages all of its customers to visit the websites of the referenced third parties or other sources to confirm whether the referenced data is accurate and reflects performance of systems available for purchase. Configuration: Intel® Core™ i7 CPU X980 system (6 cores with Hyper-Threading On), running at 3.33GHz, with 4.0GB RAM, 12M smart cache, 64-bit Windows Server 2008 R2 Enterprise SP1. For more information go to http://www.intel.com/performance
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Working with the Standards Community
CilkPlus Extensions
gcc, llvm in the works
Open sourced the cilk runtime
Proposed to C++, WG formed
ISO/IEC JTC1/SC22/WG21; INCITS/PL22.16 (C++)
INCITS/PL22.3 (Fortran)
OpenMP 4.0
23
10/3/2013
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Summary
• Efficient performance through vectorization/parallelism is beyond what can be done automatically
• Explicit vectorization achieves predictable vectorization
• Similar to what OpenMP does for parallelization
• Working towards standardization and supporting standardized parallelism/vectorization/offload constructs
24
10/3/2013
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725 begin_of_the_skype_highlighting 1-800-548-4725 FREE end_of_the_skype_highlighting, or go to: http://www.intel.com/design/literature.htm Copyright© 2013, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
26
10/3/2013