Hyper-Threading Intel Compilers Andrey Naraikin Senior Software Engineer Software Products Division Intel Nizhny Novgorod Lab November 29, 2002.

Hyper-Threading Hyper-Threading Intel CompilersIntel Compilers

Andrey NaraikinAndrey Naraikin

Senior Software EngineerSenior Software Engineer

Software Products DivisionSoftware Products Division

Intel Nizhny Novgorod LabIntel Nizhny Novgorod Lab

November 29, 2002November 29, 2002

AgendaAgenda Hyper-Threading Technology OverviewHyper-Threading Technology Overview

Introduction: Intel SW Development ToolsIntroduction: Intel SW Development Tools– MotivationMotivation

– ChallengesChallenges

– Intel SW ToolsIntel SW Tools

Intel Compilers OverviewIntel Compilers Overview– Technologies supportedTechnologies supported

– SPEC and other benchmarksSPEC and other benchmarks

– Some features supported by Intel CompilersSome features supported by Intel Compilers

Hyper-Threading Overview Today’s ProcessorsToday’s Processors

Single Processor SystemsSingle Processor Systems– Instruction Level Parallelism (ILP)Instruction Level Parallelism (ILP)

– Performance improved with more CPU resourcesPerformance improved with more CPU resources

Multiprocessor SystemsMultiprocessor Systems– Thread Level Parallelism (TLP) Thread Level Parallelism (TLP)

– Performance improved by adding more CPUsPerformance improved by adding more CPUs

Hyper-Threading technology enables TLP to single processor system.

Hyper-Threading Overview Today’s SoftwareToday’s Software

Sequential tasksSequential tasks

Parallel tasksParallel tasks

Open FileOpen File Edit Spell Check Edit Spell Check

Open DB’sOpen DB’s Address Book Address Book

InBox MeetingInBox Meeting

Hyper-Threading Overview Multi-ProcessingMulti-Processing

Multi-tasking workload + processor resources=> Improves MT Performance

Multi-tasking workload + processor resources=> Improves MT Performance

Run parallel tasks using multiple processors Run parallel tasks using multiple processors

CPU 1CPU 1

CPU 2CPU 2

CPU 3CPU 3

Hyper-Threading: Quick ViewHyper-Threading: Quick View

Dual-Core ArchitectureDual-Core Architecture

Hyper-Threading

Processor Processor Execution Execution ResourcesResources

ASAS ASAS

Multiprocessor


ASAS


ASAS

AS = Architecture State (eax, ebx, control registers, etc.), xAPIC

Hyper-Threading Technology looks like Hyper-Threading Technology looks like two processors to softwaretwo processors to software

Hyper-Threading Technology looks like Hyper-Threading Technology looks like two processors to softwaretwo processors to software

Hyper-Threading TechnologyHyper-Threading Technology

Hyper-Threading Architecture OverviewHyper-Threading Architecture Overview

Pentium, VTune and Xeon is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.

Pentium, VTune and Xeon is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.

Hyper-Threading Architecture DetailsHyper-Threading Architecture Details

Hyper-Threading Overview Resource UtilizationResource Utilization

Tim

e (p

roc.

cyc

les)

Note: Each box represents a processor execution unit

Superscalar MultiprocessingHyper-

Threading

Multiprocessing With Hyper-Threading

Performance BenefitPerformance Benefit

0

0.5

1

1.5

2

A1 A2 A3 A4 A5 A6 A7 A8 A9

Application

Rel

ativ

e S

pee

du

p

SMP

HTT

Serial


CodeCode DescriptionDescription

A1A1 EngineeringEngineering

A2A2 GeneticsGenetics

A3A3 ChemistryChemistry

A4A4 EngineeringEngineering

A5A5 WeatherWeather

A6A6 GeneticsGenetics

A7A7 CFDCFD

A8A8 FEAFEA

A9A9 FEAFEA

“Hyper-Threading Technology: Impact on Compute-Intensive Workloads,” Intel Technical Journal, Vol. 6, 2002.

Key PointKey Point

Hyper-Threading Technology gives better utilization of processor resources

Hyper-Threading Technology gives more computing power for multithreaded applications


CollateralCollateralWeb SitesWeb Sites

– http://developer.intel.com/technology/hyperthread/http://developer.intel.com/technology/hyperthread/– http://developer.intel.com/design/pentium4/applnotshttp://developer.intel.com/design/pentium4/applnots– http://developer.intel.com/design/pentium4/manualshttp://developer.intel.com/design/pentium4/manuals

Documentation and application notesDocumentation and application notes– IA-32 IntelIA-32 Intel®® Architecture Software Developer’s Manual Architecture Software Developer’s Manual – Intel PentiumIntel Pentium®® 4 and Intel Xeon 4 and Intel XeonTMTM Processor Optimization Manual Processor Optimization Manual– Intel App Note AP485 - “Intel Processor Identification and CPU Intel App Note AP485 - “Intel Processor Identification and CPU

Instructions”Instructions”– Intel App Note AP 949 “Intel App Note AP 949 “ Using Spin-Loops on Intel Pentium 4 Using Spin-Loops on Intel Pentium 4

Processor and Intel Xeon Processor”Processor and Intel Xeon Processor”– Intel App Note “Detecting Support for Jackson Technology Intel App Note “Detecting Support for Jackson Technology

Enabled Processors”Enabled Processors”

Collateral (Cont’d)Collateral (Cont’d)Intel Technology Journal Intel Technology Journal

– http://developer.intel.com/technology/itj/http://developer.intel.com/technology/itj/

Intel Threading ToolsIntel Threading Tools– http://www.intel.com/software/products/http://www.intel.com/software/products/

OpenMPOpenMP– http://www.openmp.orghttp://www.openmp.org

HT Overview HT Overview – http://www.ixbt.com/cpu/pentium4-3ghz-ht.shtmlhttp://www.ixbt.com/cpu/pentium4-3ghz-ht.shtml

Performance AdvantagePerformance AdvantageOptimization PathOptimization Path

StandardStandardCompilerCompiler

Little or Little or No Code ChangeNo Code Change

Minor Code ChangeMinor Code Change(1 Line)(1 Line)

13x13x

Analysis with VTune™Analysis with VTune™

1x1x

Intel SW Development Tools

4x4x

IntelIntelCompilerCompiler

7x7x

9x9xOpenMPOpenMP

ThreadingThreading



15x faster15x faster

OpenMPOpenMPThreadingThreading


MinorMinorCode ChangeCode Change

PerformancePerformanceLibrariesLibraries

(IPP or MKL)(IPP or MKL)

StandardStandardCompilerCompiler





Sunset Simulation Sunset Simulation Optimized PerformanceOptimized Performance


15x faster15x faster

Intel® CompilersIntel® Compilers

C, C++ and Fortran95C, C++ and Fortran95– Available on Windows* and Linux*Available on Windows* and Linux*– Available for 32-bit and 64-bit platformsAvailable for 32-bit and 64-bit platforms

Utilization of latest processor/platform featuresUtilization of latest processor/platform features– Optimizations for NetBurst™ architecture (Pentium® 4 and Optimizations for NetBurst™ architecture (Pentium® 4 and

Xeon™ processor)Xeon™ processor)– Optimizations for Itanium® architecture Optimizations for Itanium® architecture

Seamless integration into Windows* (IDE)Seamless integration into Windows* (IDE)and Linux* environmentand Linux* environment

Source and binary compatible with Microsoft* Source and binary compatible with Microsoft* compiler; compiler; mostly source compatible with GNU (gcc)mostly source compatible with GNU (gcc)

Intel SW Development Tools – Compilers

Benchmarks: Intel® Compilers 6.0 Benchmarks: Intel® Compilers 6.0 for Windows*for Windows*

SPECint_base2000

Configuration info: Intel® Pentium® 4 Processor, 2.4 GHz, Intel® Medford 850 Motherboard,

(D850MD 850 motherboard) Chipset,256 MB Memory, Windows* XP Professional

Edition (build 2600), GeForce 3/nVidia* Graphics

SPECfp_base2000(Geomean of Fortran)

400

500

600

700

800

900

CVF* 6.6 Intel® Fortran Compiler 6.0

28%Faster

Floating-point Performance!!

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. Users’ results are dependent upon the application characteristics (loopy vs. flat), mix of C and C++, and other factors. For more information on performance tests and on the performance of Intel products, reference [www.intel.com] or call (U.S.) 1-800-628-8686 or 1-916-356-3104.

400

500

600

700

800

900

Leading C++ Compiler Intel® C++ Compiler 6.0

17%Faster Integer Performance!!

SPECint_base2000 = 703

SPECint_base2000 = 825Geomean of Fortran = 881

Geomean of Fortran = 686


Intel® C++ Compiler 6.0 for Linux*Intel® C++ Compiler 6.0 for Linux*

PovRay Image Rendering TimePovRay Image Rendering Time

Configuration info: Intel® Pentium® 4 processor, 2.0 GHz, 256 MB Memory, nVidia* GeForce 2 graphics card, Linux* 2.4.7, PovRay 3.1G


60%

80%

100%

120%

140%

160%

gcc 2.96, O2 andFast-math

Optimization

Intel® 6.0 ComparableOptimization

Intel® 6.0 MaximumOptimization

20.30 Seconds

14.75 Seconds

13.57 Seconds

Imp

rove

me

Imp

rove

me

nt

nt

Special Performance FeaturesSpecial Performance Features

Auto-Vectorization for NetBurst™ architectureAuto-Vectorization for NetBurst™ architecture Software-Pipelining for EPIC architectureSoftware-Pipelining for EPIC architecture Auto-Parallelization and OpenMP based parallelizationAuto-Parallelization and OpenMP based parallelization

– for Hyper-Threading and multi-processor systemsfor Hyper-Threading and multi-processor systems Data Pre-FetchingData Pre-Fetching Profile-Guided Optimization (PGO)Profile-Guided Optimization (PGO) Inter-procedural Optimization (IPO)Inter-procedural Optimization (IPO) CPU Dispatch CPU Dispatch

– Establishes code path at runtime dependent on actual processor type Establishes code path at runtime dependent on actual processor type – Allows single binary with optimal performance across Allows single binary with optimal performance across

processor familiesprocessor families


TechniquesTechniques Overview Overview

Exploit parallelism to speedup applicationExploit parallelism to speedup applicationVectorizationVectorization

– Supported by programming languages and Supported by programming languages and compilerscompilers – Motivated by modern architecturesMotivated by modern architectures

Superscalarity, deeply pipelined coreSuperscalarity, deeply pipelined core SIMDSIMD Software pipelining on ItaniumSoftware pipelining on Itanium™ architecture™ architecture

ParallelizationParallelization – OpenMPOpenMP™™ directives for shared memory directives for shared memory

multiprocessor systemsmultiprocessor systems– MPI computations for clustersMPI computations for clusters

Features by Intel Compilers

Intel processors and vectorizationIntel processors and vectorization

Pentium® with MMX™technology, Pentium® IIprocessors

Pentium® III processor

Pentium® 4 processor

Integer types, 64 bits

Streaming SIMD Extensions (SSE),Single precision floating point

Streaming SIMD Extensions 2 (SSE 2),Double precision floating point,Integer types, 128 bits

Type of processor Vectorization features supported

Features by Intel Compilers - Vectorization

Compiler automatically transforms Compiler automatically transforms sequential code for SIMD executionsequential code for SIMD execution

Automatic VectorizationAutomatic Vectorization

for (i=0; i<n; i++) { a[i] = a[i] + b[i]; a[i] = sin(a[i]);}

for(i=0; i<n; i=i+VL) { a(i : i+VL-1) = a(i : i+VL-1) + b(i : i+VL-1); a(i : i+VL-1) = _vmlSin(a(i : i+VL-1));}

icl - Qx[MKW]

Run-Time Run-Time LibraryLibrary

HW SIMD HW SIMD instructioninstruction


Vectorization ExampleVectorization Example

0.0 1.0 2.0 3.0 4.0 5.0

0.0 1.0 2.0 3.0 4.0 5.0

6.0

6.0

7.0

7.0

8.0

8.0

9.0

9.0

0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 18.0

a

b

Scalar

Vector 4.0 6.0 8.0 10.0 12.0 14.0 16.0 18.00.0 2.0


double a[N], b[N]; int i;

for (i = 0; i < N; i++) a[i] = a[i] + b[i];

icl - QxW

Reduction ExampleReduction Example

a 11.00.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0

0.0 0.0 0.0 0.0

0.0 1.0 2.0 3.0

4.0 6.0 8.0 10.0

12.0 15.0 18.0 21.0

30.0 36.0

66.0

Loop kernel

Postlude

float a[N], x;

int i;

x=0.0;

for (i = 0; i < N; i++)

x += a[i];


Parallel Program DevelopmentParallel Program Development

Ease of use/

maintenaince

Explicit threading using operating system callsWith industry standard OpenMP* directivesAutomatically using the compiler

Parallelization

Features by Intel Compilers - Parallelization

AutoparallelizationAutoparallelization

float a[N], b[N], c[N];int i;for (i=0; i<N; i++) c[i] = a[i]*b[i];

icl -Qparallel foo.c { -xparallel on Linux}

….foo.c

foo.c(7) : (col. 2) remark: LOOP WAS AUTO-PARALLELIZED....

./foo.exe -- Executable detects and uses number of processors…

-Qpar_report[n] - get helpful messages from the compiler


OpenMP™ DirectivesOpenMP™ Directives

OpenMP* standard (OpenMP* standard (www.openmp.orgwww.openmp.org))– Set of directives to enable the writing of multithreaded Set of directives to enable the writing of multithreaded

programsprogramsUse of shared memory parallelism on Use of shared memory parallelism on

programming language levelprogramming language level– PortabilityPortability– PerformancePerformance

Support by Intel® CompilersSupport by Intel® Compilers – Windows*, Linux*Windows*, Linux*– IA-32 and ItaniumIA-32 and Itanium™™ architectures architectures


Simple DirectivesSimple Directivesfoo(float *a, float *b, float *c){ int i;#pragma parallel for (i=0; i<N; i++) { *c++ = (*a++)*bar(b++); };}

Pointers and procedure calls with escaped pointers prevent analysis for autoparallelization

Use simple directives instead


void foo()void foo()

{ int a[1000], b[1000], c[1000], x[1000], i, NUM;{ int a[1000], b[1000], c[1000], x[1000], i, NUM;

/* parallel region *//* parallel region */

#pragma omp parallel private(NUM) shared(x, a, b, c)#pragma omp parallel private(NUM) shared(x, a, b, c)

{ NUM = omp_get_num_threads();{ NUM = omp_get_num_threads();

#pragma omp for private(i) /* work-sharing for loop */#pragma omp for private(i) /* work-sharing for loop */

for (i = 0; i< 1000; i++) {for (i = 0; i< 1000; i++) {

x[i] = bar(a[i], b[i], c[i], NUM); /* assume bar has no side-effects */ x[i] = bar(a[i], b[i], c[i], NUM); /* assume bar has no side-effects */

}}

}}

}}

OpenMP* DirectivesOpenMP* Directives

icl -Qopenmp -c foo.c { -xopenmp on Linux}foo.cfoo.c(10) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.foo.c(7) : (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED.


OpenMP™ + VectorizationOpenMP™ + Vectorization

Combined speedupCombined speedupOrder of use might be importantOrder of use might be important

– Parallelization overheadParallelization overhead

– Vectorize inner loopsVectorize inner loops

– Parallelize outer loopsParallelize outer loops

Supported by Intel® CompilersSupported by Intel® Compilers

Features by Intel Compilers

Make performance a feature of your applications today –

stay competitive

Make performance a feature of your applications today –

stay competitive

Intel® CompilersIntel® Compilers

Leading-Edge compiler technologiesLeading-Edge compiler technologiesCompatible with leading industry standard Compatible with leading industry standard

compilerscompilersProcessor optimized code generationProcessor optimized code generationSupport single source code across Intel Support single source code across Intel

processor familiesprocessor families


CollateralCollateralIntel Technology Journal Intel Technology Journal

– http://developer.intel.com/technology/itj/http://developer.intel.com/technology/itj/

Intel Threading ToolsIntel Threading Tools– http://www.intel.com/software/products/http://www.intel.com/software/products/

OpenMPOpenMP– http://www.openmp.orghttp://www.openmp.org

HT Overview HT Overview – http://www.ixbt.com/cpu/pentium4-3ghz-ht.shtmlhttp://www.ixbt.com/cpu/pentium4-3ghz-ht.shtml

To be continued…To be continued…

Hyper-Threading Intel Compilers Andrey Naraikin Senior Software Engineer Software Products Division Intel Nizhny Novgorod Lab November 29, 2002.

Documents

hyperthreading slide

intel technical journal

quick view slide

multiple processors

comdesignpentium4manuals

comtechnologyhyperthread

comdesignpentium4applnots

architecture state eax