Palestra ministrada por Leonardo Borges no Intel Software Conference nos dias 6 de Agosto (NCC/UNESP/SP) e 12 de Agosto (COPPE/UFRJ/RJ).
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
and/or other countries. *Other names and brands may be claimed as the property of others.8
Intel Many Integrated Core (MIC, pronounced “Mike”)
Product Family/Architecture for Highly Parallel Applications
• Based on large number of smaller, low power, Intel Arch. Cores
• 512-bit wide vector engine
• Compliments Intel Xeon processor product line
• Provides breakthrough performance for highly parallel apps
– Familiar x86 programming model– Same source code supports both Intel Xeon processor & Intel Xeon Phi coprocessor– Initially a coprocessor with PCI Express form factor
First products announced at SC12: Code named Knights Corner (KNC)
• Up to 61 cores, 4 threads per core
• Up to 16GB GDDR5 memory (up to 352 GB/s)
• 225-300W (Cooling: Both passive & active SKUs)
• x16 PCIe Form-Factor (requires IA host)
8
Intel® Xeon® Phi™ Product FamilyBased on the Intel MIC Architecture
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Source: Intel Measured results as of October 26, 2012 Configuration Details: Please reference slide speaker notes.
For more information go to http://www.intel.com/performance
Notes:
1. 2 X Intel® Xeon® Processor E5-2670 (2.6GHz, 8C, 115W)
2. Intel® Xeon Phi™ coprocessor 5110P (ECC on) with Gold RC SW stack (Coprocessor power only)
Higher is Better
Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native)
High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software
Intel Xeon Phi Case Studies
Intel Xeon Phi Ecosystem
Conclusions & References
INTEL CONFIDENTIAL24
145X
FASTER
0.46SECONDS
STEP 1.
OPTIMIZE CODE
Parallelize and vectorize code and continue to run on
multi-core Intel Xeon processors
67.097SECONDS
CurrentPerformance
STARTING POINT
Unoptimized serial code running on multi-core
Intel® Xeon® processors
2.3XFASTER
0.197SECONDS
STEP 2.
USE COPROCESSORS
Run all or part of the optimized code on Intel®
Xeon Phi™ coprocessors
The Following Performance Results are Based on Already Optimized Code
SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012
Example: A Two-Step Process with SAXPY
Parallelizing for High Performance
340XFASTER
INTEL CONFIDENTIAL
• Application: Hybrid Monte-Carlo program that simulates lattice QCD with dynamical Wilson fermions. It is one of the main production programs of the QCDSF collaboration (DEISA) and beyond used for quark simulation.
• Status: Many optimizations already in released version; more optimizations and alternative offload model version in development
• Demonstrated Results:
- No source code changes
- Recompiled, selected run-time parameters to get maximum performance
25
Performance Proof-Point: Government and Academic Research
BQCD
“The performance improvement for BQCD using the Intel Xeon Phi coprocessor was reached in record time, requiring only recompilation. We are confident that larger speed-ups can be obtained with modest modifications of the code.”
• Application: Monte Carlo algorithms are used to evaluate complex instruments, portfolios, and investments. Performance depends on raw computational power and the performance of exp2()
• Status: Case Study available
• Highlights: Dramatic performance scaling for bothsingle-precision and double-precision calculations
• Demonstrated Results:
- Intel® Xeon Phi™ coprocessor fast exp2() and FMA instructions deliver high performance, high accuracy for single precision computations
- Compiler based loop unrolling delivers high performance
- Cache blocking further optimizes cache utilization, reduces cache misses, and makes outer loop vectorization possible
• Read the Case Study: software.intel.com/en-us/articles/case-
• Application: Weather Research and Forecasting (WRF)
• Status: WRF V3.5 was released 4/18/13
• Code Optimization:
– Approximately two dozen files with less than 2,000 lines of code were modified (out of approximately 700,000 lines of code in about 800 files, all Fortran standard compliant)
– Most modifications improved performance for both the host and the co-processors
• Performance Measurements: Pre release of WRF 3.5 (V3.5Pre) and NCAR supported CONUS2.5KM benchmark (a high resolution weather forecast)
• Acknowledgments: There were many contributors to these results, including the National Renewable Energy Laboratory and The Weather Channel Companies
Performance Proof-Point: Government and Academic Research
• Application: Sandia National Laboratories' best approximation to an unstructured implicit finite element or finite volume application in fewer than 8000 lines of code
• Status: available at http://software.sandia.gov/trac/mantevo/browser/trunk/packages
• Demonstrated Results:- Porting was easy using OpenMP- Substituting an Intel MKL routine for the sparse matrix-
vector product accelerated performance and will simplify future optimization
- The Intel MPI Library enables rapid performance improvement when adding an Intel® Xeon Phi™ coprocessor
• Read the Case Study:
29
Performance Proof-Point: Government and Academic Research
“The programming models available for the Intel MIC Architecture are open-standard and portable between traditional processors and Intel Xeon Phi coprocessors. This should allow us to leverage code development across multiple platforms.”James A. Ang, Ph.D.Extreme-scale Computing, Sandia National Laboratories
1. 8 node cluster, each node with 2S Xeon* (comparison is cluster performance with and without 1 Xeon Phi* per node) (Hetero)2. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted)3. 2S Xeon* vs. 2S Xeon* + 2 Xeon Phi* (offload)
UP TO
3.54X
China Oil & Gas Geoeast Pre-stack
Time Migration3
SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012
Notes:1. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor unless otherwise noted)2. Intel Measured Oct. 20123. Includes additional FLOPS from transcendental function unit
SPEED-UP
2.11X
Intel Labs Ray Tracing2
Embree Ray Tracing
SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012
32
Introduction
High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software
Intel Xeon Phi Case Studies
Intel Xeon Phi Ecosystem
Conclusions & References
INTEL CONFIDENTIAL
• System: TACC Stampede is a 10 petaflop supercomputer, one of the largest computing systems in the world for open science research. It became operational on January 7, 2013
• Status: In Service
• Workloads: Runs hundreds of applications for thousands of users around the world
• Performance:
– More than 7 petaflops using Intel® Xeon Phi™ coprocessors1
System: Located in Southwest China, it contains 16,000 nodes composing the world's largest (public) installation of Intel Ivy Bridge and Xeon Phi’s processors. Each cluster node is formed with
• 2 CPUs hex-core Intel® Xeon® Ivy-Bridge @ 2.2GHz• 3 Intel® Xeon Phi™ cards, each with 57 cores @ 1.1GHz
Performance: Theoretical peak of 54.9 Pflop/s
• 6.8 Pflop/s from 32,000 Xeon Ivy Bridge sockets • 48.1 Pflop/s from 48,000 Xeon Phi cards• for a total of 3,120,000 cores.
30.65 Pflop/s sustained Linpack.
More Information: "Visit to the National University for Defense Technology Changsha, China." Jack Dongarra, University of Tennessee, and Oak Ridge National Laboratory. June 2013. www.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2-dongarra-report.pdf
and/or other countries. *Other names and brands may be claimed as the property of others.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.