Top Banner
Preparing Codes for Intel Knights Landing (KNL) Getting ready for the next generation Intel Xeon Phi processors
20

Preparing Codes for Intel Knights Landing (KNL)

Jan 11, 2017

Download

Technology

AllineaSoftware
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Preparing Codes for Intel Knights Landing (KNL)

Preparing Codes for Intel Knights Landing

(KNL)

Getting ready for the next generation Intel Xeon Phi processors

Page 2: Preparing Codes for Intel Knights Landing (KNL)

caption

Hey, at least it compiles

Optimized for modern CPUs

An Uncomfortable Truth about Software

Page 3: Preparing Codes for Intel Knights Landing (KNL)

The upcoming Intel Knights Landing Platform

• 72 cores– 36 tiles in a 2D mesh

– 16GB MCDRAM (HBM) total

• Each tile: – 2 cores

– 2 AVX-512 vector processing units per core

– 1MB L2 cache

• More info:– http://www.anandtech.com/show/9802/supercomputing-15-intels-

knights-landing-xeon-phi-silicon-on-display

• Extracting performance means:– Using better memory access patterns

– Using more threads

– Using vectorization

Page 4: Preparing Codes for Intel Knights Landing (KNL)

Flow for scaling up: Knights Landing codes

1. Analyze a realistic run with Allinea Performance Reports

• Fix any obvious issues!

• Then pursue steps 2 and 3

2. Is memory access dominating?

• Improve cache usage (bandwidth)

• Increase thread count (latency)

3. Scalar numeric ops dominating?

• Go for vectorization!

Page 5: Preparing Codes for Intel Knights Landing (KNL)

caption

Out-of-order PipelinedTime per retired

instruction

The performance of processors is complex

• …so it’s important to communicate what is happening at the user’s abstraction level

Page 6: Preparing Codes for Intel Knights Landing (KNL)

caption

Step 1: Obtain a performance overview

• Allinea Performance Reports runs on Linux applications

• It identifies how well they are using the processor

• “I didn’t know I wasn’t using vectorization!”

Page 7: Preparing Codes for Intel Knights Landing (KNL)

Performance Reports are not just for vectorization, but for MPI, I/O,

memory and energy usage too

caption

Page 8: Preparing Codes for Intel Knights Landing (KNL)

caption

Statistical wallclock time estimate of:

• Scalar numeric operations

• AVX/AVX2/… operations

• Memory accesses

• Other (branch, logic, …)

+ simple, actionable advice

Step 1: Focus on CPU performance report section

Page 9: Preparing Codes for Intel Knights Landing (KNL)

Step 2: If memory access is dominating

• This has to be solved before going further!– The processor is thirsty for data – help get it there faster!

– Visualize the memory access patterns

• Many strategies can improve access patterns– Blocking – try to keep in the cache by re-ordering into 2-d, 3-d blocks

– Latency hiding – use more threads than you have cores

– Mixing MPI and OpenMP – partitioning into processes can enable better memory access patterns

• KNL helps the memory situation..– But it’s limited to 16GB – and that’s not enough to replace DIMMs

– Explicitly use HBM for key arrays to reduce the cost of cache misses

Page 10: Preparing Codes for Intel Knights Landing (KNL)

Cache

Knights Landing High Bandwidth Memory detail

Flat (NUMA) Hybrid

• High Bandwidth Memory: 16GB of on-package high-bandwidth MCDRAM

• 3 possible modes of setting (boot time)

Page 11: Preparing Codes for Intel Knights Landing (KNL)

Step 2: Our tools that help improve memory access

• Allinea Performance Reports– Follow guidance to improve usage without changing code: Experiment with more/fewer

threads

• Allinea Forge (developer tools)– Allinea MAP – take real workload and find the area of code that dominates

• Rework access patterns that are in the dominating loops

– Allinea DDT – debugger will track memory allocations and leaks within HBM

• Ensure this memory is not wasted

Page 12: Preparing Codes for Intel Knights Landing (KNL)

Small data files

Just <5% slowdown

No instrumentation

No recompilation

Step 2: Use Allinea MAP – profile the code for a real run

Page 13: Preparing Codes for Intel Knights Landing (KNL)

How Allinea MAP is different

Adaptive sampling

Sample frequency

decreases over time

Data never grows too

much

Run for as long as you

want

ScalableSame scalable infrastructure

as Allinea DDT

Merges sample data at end of

job

Handles very high core

counts, fast

Instruction analysis

Categorizes instructions

sampled

Knows where processor

spends time

Shows vectorization and memory bandwidth

Thread profiling

Core-time not thread-time

profiling

Identifies lost compute time

Detects OpenMP issues

Integrated Part of Forge tool suite

Zoom and drill into profile

Profiling within your

code

Page 14: Preparing Codes for Intel Knights Landing (KNL)

Above all…

• Aimed at any performance problem that matters

– MAP focuses on time

• Does not prejudge the problem

– Doesn’t assume it’s MPI messages, threads or I/O

• If there’s a problem..

– MAP shows you it, next to your code

• Fix directly – or target with the right follow up tool/activity

– Intel Vtune, Vectorization Adviser, icc…

Page 15: Preparing Codes for Intel Knights Landing (KNL)

Step 2: Improve thread usage

• Intel Knights Landing needs lots of threads for performance: OpenMP is a common way to do this– Even if you already use OpenMP… you may have used more threads than the code was designed for

• Profile what actually happens: use Allinea MAP– Common error: using OpenMP for every outer loop – including where one thread would be faster

• This example (above) shows a dark gray area of thread inactivity for OpenMP code– OpenMP threads are stalled sharing a small amount of work

– Removing OpenMP actually improves the performance!

Page 16: Preparing Codes for Intel Knights Landing (KNL)

Real optimization of OpenMP regions

• NB – still profiling for

first 300 seconds only

• Significant inactivity in

final 60 seconds

• OpenMP region

– #pragma omp parallel for

• Let’s remove

Page 17: Preparing Codes for Intel Knights Landing (KNL)

Step 3: Going for Vectorization

• Use Allinea MAP to find the key loops (illustrated)

• Use vectorization adviser for those loops

• Use output from the compiler

• Build a mental model of the data access (again!)

Page 18: Preparing Codes for Intel Knights Landing (KNL)

Lessons learned in practice

• Real codes exhibit many different performance patterns– Profiling real data sets at real scales is vital to target the effort

• Changing code can be simple– Use threads wisely – it will not always be faster

– Look for other libraries – someone else might have fixed your problem

• Re-engineering is necessary– Take advantage of vector units

– Take advantage of threads

– Take advantage of HBM

Page 19: Preparing Codes for Intel Knights Landing (KNL)

Debug

• First-class Intel® Xeon Phi™ support

• Memory debugging enhancements for HBM

Tune and Analyze

• First-class Intel® Xeon Phi™ support

• Additional Intel® Xeon Phi™ metrics – watch this space!

Profile

• First-class Intel® Xeon Phi™ support

• Additional Intel® Xeon Phi™ metrics – watch this space!

Our Intel® Xeon Phi™ Knights Landing Support

Page 20: Preparing Codes for Intel Knights Landing (KNL)

Increase the output of your system

Analyze and tune with Allinea Performance Reports

Develop, profile and debug applications with Allinea Forge

With professional support when you need it most