March 25, 2022 Computation Products Group 1 AMD Opteron Architecture and AMD Opteron Architecture and Software Infrastructure Software Infrastructure Tim Wilkens Ph.D. Tim Wilkens Ph.D. Member of Technical Staff Member of Technical Staff [email protected][email protected]
49
Embed
June 1, 2015Computation Products Group1 AMD Opteron Architecture and Software Infrastructure Tim Wilkens Ph.D. Member of Technical Staff [email protected].
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Not all X86 Processors are created =Not all X86 Processors are created = RISC Cores – scrupulous instruction preference
Scalable Memory Bandwidth and IOScalable Memory Bandwidth and IO physical memory scales with CPU # memory bandwidth scales with CPU # increased single threaded memory bandwidth memory latency does not scale with CPU # dramatically lower memory latency
Instruction Control Unit (72 entries)Instruction Control Unit (72 entries)
Itanium requires the compiler to think for it – Itanium requires the compiler to think for it – strong compiler reliancestrong compiler reliance
Both Both OpteronOpteron and and ItaniumItanium are RISC, but Opteron doesn’t require are RISC, but Opteron doesn’t require reinventing compilersreinventing compilers, , large caches large caches & & a mint to purchacea mint to purchace
April 18, 2023 Computation Products Group 5
All X86 RISC Cores aren’t created = All X86 RISC Cores aren’t created =
Opteron vs Xeon EMTOpteron vs Xeon EMT
FADDFADD FMULFMUL
MULTMULT
FSTFSTALUALU
AGUAGU
ALUALU
AGUAGU
ALUALU
AGUAGU
OpteronOpteron: : INT and FP Execution UnitsINT and FP Execution Units
FADDFADD FMULFMUL
Xeon EMTXeon EMT: : FP Execution UnitsFP Execution Units
80-bits80-bits
128-bits128-bits
8080-bit x -bit x 33 = = 240240-bit bandwidth-bit bandwidth
128128-bits x -bits x 11 = = 128128-bit -bit bandwidthbandwidthconstriction limits performanceconstriction limits performance
8080-bit x -bit x 22 = = 160160 bits bits
1212 pipelinepipelinestagesstages
3131 pipelinepipelinestagesstages
# of int pipes and pipeline depth impact integer throughput# of int pipes and pipeline depth impact integer throughput
Opteron has 3 integer pipes – Opteron has 3 integer pipes – +50% reg,reg move thoughput+50% reg,reg move thoughput
Opteron has 3 Opteron has 3 ALUALU//AGUAGU units – units – +50% +,-,logical, shift throughput+50% +,-,logical, shift throughput
size dictates # size dictates # RISCRISC ops an x86 instruction decodes into ops an x86 instruction decodes into
instruction selection preference is different for Opteron and Xeon64instruction selection preference is different for Opteron and Xeon64
Design of FPU and issue bandwidth from FP schedulerDesign of FPU and issue bandwidth from FP scheduler
Opteron has 240 bits per clock SIMD throughput, Xeon has 128 bitsOpteron has 240 bits per clock SIMD throughput, Xeon has 128 bits
Coupled with register file size, Opteron is a more robust engineCoupled with register file size, Opteron is a more robust engine
Though Xeon64 and Opteron are instruction compatible, OpteronThough Xeon64 and Opteron are instruction compatible, Opterondoesn’t require extensive compiler tuning to perform welldoesn’t require extensive compiler tuning to perform well
April 18, 2023 Computation Products Group 6
AMD OpteronAMD OpteronTMTM,Pentium,Pentium®®4 4 (FPU analysis)(FPU analysis) Throughput of SSE, SSE2, x87 OperationsThroughput of SSE, SSE2, x87 Operations
AMD Opteron™ Processor ServerAMD Opteron™ Processor Server Intel Xeon MP Processor ServerIntel Xeon MP Processor Server
KeyMemory TrafficI/O TrafficIPC Traffic
KeyMemory TrafficI/O TrafficIPC Traffic
HyperTransport™ Technology Buses HyperTransport™ Technology Buses for Glueless I/O or CPU Expansionfor Glueless I/O or CPU Expansion
HyperTransport™ Technology HyperTransport™ Technology Buses Enable Glueless Buses Enable Glueless Expansion for up to 8-way Expansion for up to 8-way ServersServers
Separate Memory andSeparate Memory andI/O Paths Eliminates Most I/O Paths Eliminates Most Bus ContentionBus Contention
HyperTransportHyperTransportLink Has Ample Link Has Ample Bandwidth For Bandwidth For I/O DevicesI/O Devices
Daimler, Morton Thiokol, AMD,Martin Marietta, Ducati, Renault
Crash AnalysisCrash Analysis
NASA, Boeing, Volvo, Mitsubishi,
Ferrari, Volkswagen, Airbus, GM, Ford,
Daimler, Honda
Digital SignalDigital SignalProcessingProcessing
NSA, DEA, CIA, Texas
Instruments, AT&T, Sprint,
MCI
Materials ScienceMaterials ScienceBiologyBiology
Eli Lilly, Bristol Meyers, Dow Chemical, DuPont, Union Carbide, Pfizer, Genentech, Genencor,
Accelrys, Incyte Genomics
Oil and GasOil and Gas
Shell, BP, Total, Petrobras,
Halliburton, ChevronTexaco,
ExxonMobil, Aramco
EducationEducationDefenseDefense
PNNL, LANL, LLNL, ORNL,
NCSA, ANL, SNL, BNL, FNAL, NERSC
Materials ScienceMaterials ScienceBiologyBiology
Eli Lilly, Bristol Meyers, Dow Chemical, DuPont, Union Carbide, Pfizer, Genentech, Genencor,
Accelrys, Incyte Genomics
Financial AnalysisFinancial Analysis
NumeriX, Palisade, MathWorks, Wolfram
Research, Goldman Sachs, Morgan Stanley, JP Morgan,
Salomon Brothers
EducationEducationDefenseDefense
PNNL, LANL, LLNL, ORNL,
NCSA, ANL, SNL, BNL, FNAL, NERSC
Materials ScienceMaterials ScienceBiologyBiology
Eli Lilly, Bristol Meyers, Dow Chemical, DuPont, Union Carbide, Pfizer, Genentech, Genencor,
Accelrys, Incyte Genomics
Financial AnalysisFinancial Analysis
NumeriX, Palisade, MathWorks, Wolfram
Research, Goldman Sachs, Morgan Stanley, JP Morgan,
Salomon Brothers
Digital SignalDigital SignalProcessingProcessing
NSA, DEA, CIA, Texas
Instruments, AT&T, Sprint,
MCI
April 18, 2023 Computation Products Group 13
Gaming – Real World RealismGaming – Real World Realism water surfaces, physics gaming engineswater surfaces, physics gaming engines
Rendered MoviesRendered Movies modeling real clothing surfaces (PDEs)modeling real clothing surfaces (PDEs)
Medical ProceduresMedical Procedures CATCAT scan imaging, Cancer Radiation Therapy scan imaging, Cancer Radiation Therapy
Airline Flight SchedulesAirline Flight Schedules minimizing equations of constraint (fuel, food, time, etc)minimizing equations of constraint (fuel, food, time, etc)
National SecurityNational Security voice analysis and authentication, weapons simulationvoice analysis and authentication, weapons simulation
Connecting HPC and youConnecting HPC and youHow HPC impacts our daily livesHow HPC impacts our daily lives
April 18, 2023 Computation Products Group 14
AMD Core Math LibraryAMD Core Math Library ( (ACMLACML))Assembly Optimizations and AccuracyAssembly Optimizations and Accuracy
““Give me a long lever and a place upon which Give me a long lever and a place upon which to stand and I will move the world ” Archimedes circa 250 B.Cto stand and I will move the world ” Archimedes circa 250 B.C
DGEMM: Small Square Matrix Timings of ACML 2.0, 2.1 and 2.5 on 2Ghz Opteron
400
800
1200
1600
2000
2400
2800
3200
3600
4 14 24 34 44 54 64 74 84 94
Square Matrix Size (Order)
MFL
OP
S
ACML 2.0 ACML 2.1 ACML 2.5
April 18, 2023 Computation Products Group 26
Compiler EcosystemCompiler Ecosystem
PGIPGI , , Pathscale , GNU , , GNU , AbsoftIntel, Microsoft and SUN
April 18, 2023 Computation Products Group 27
Compiler Comparisons TableCompiler Comparisons TableCritical Features Supported by x86 CompilersCritical Features Supported by x86 Compilers
VectorVector
SIMDSIMD
SupportSupport
PeelsPeels
VectorVector
LoopsLoops
GlobalGlobal
IPAIPA
OpenOpen
MPMP
LinksLinks
ACMLACMLLibrariesLibraries
Profile Profile
GuidedGuided
FeedbackFeedback
AlignsAligns
VectorVector
LoopsLoops
ParallelParallel
DebuggersDebuggers
Large Large Array Array
SupportSupport
Medium Medium Memory Memory ModelModel
PGIPGI
GNUGNU
IntelIntel
PathscalePathscale
AbsoftAbsoft
SUNSUN
MicrosoftMicrosoft
April 18, 2023 Computation Products Group 28
Tuning Performance with CompilersTuning Performance with CompilersMaintaining Stability while OptimizingMaintaining Stability while Optimizing
STEP 0: Build application using the following procedure:STEP 0: Build application using the following procedure:
compile all files with the most aggressive optimization flags below:compile all files with the most aggressive optimization flags below:
-tp k8-64 –fastsse-tp k8-64 –fastsse
if compilation fails or the application doesn’t run properly, turn off if compilation fails or the application doesn’t run properly, turn off vectorization:vectorization:
if problems persist compile at Optimization level 1:if problems persist compile at Optimization level 1:
-tp k8-64 –O0-tp k8-64 –O0
STEP 1: Profile binary and determine performance critical STEP 1: Profile binary and determine performance critical routinesroutines
STEP 2: Repeat STEP 0 on performance critical functions, one STEP 2: Repeat STEP 0 on performance critical functions, one at a time, and run binary after each step to check stabilityat a time, and run binary after each step to check stability
April 18, 2023 Computation Products Group 29
Tuning Memory IO BandwidthTuning Memory IO BandwidthOptimizing large streaming operationsOptimizing large streaming operations
2 Methods of writing to memory in x86/x86-64:2 Methods of writing to memory in x86/x86-64:
traditional memory stores cause write allocates to cachetraditional memory stores cause write allocates to cache
1.1. page to be modified is read into cachepage to be modified is read into cache
2.2. cache is modified, written to memory when new memory page loadedcache is modified, written to memory when new memory page loaded
3.3. to write N bytes, 2N bytes of bandwidth generatedto write N bytes, 2N bytes of bandwidth generated
non-temporal stores bypass cache and write directly to memorynon-temporal stores bypass cache and write directly to memory
1.1. no write allocate to cacheno write allocate to cache, to write N bytes, , to write N bytes, N bytes of bandwidth generatedN bytes of bandwidth generated
2.2. data is not backed up into cache, do not use with often reused datadata is not backed up into cache, do not use with often reused data
Use only on functions which write L2/2 > bytes of data or Use only on functions which write L2/2 > bytes of data or more, normally would assure little cache reuse valuemore, normally would assure little cache reuse value
Group all eligible routines into a common file to as toGroup all eligible routines into a common file to as tosimplifysimplify the compilation procedure. Enable non-temporal storesthe compilation procedure. Enable non-temporal stores
in PGIin PGI compiler with the –Mnontemporal compiler optioncompiler with the –Mnontemporal compiler option
Below are 3 different sets of recommended PGI compiler Below are 3 different sets of recommended PGI compiler flags for flag mining application source bases:flags for flag mining application source bases:
enables instruction level tuning for Opteron, O2 level optimizations, sse scalar and vector code generation, inter-procedural analysis, LRE optimizations and unrolling
strongly recommended for any single precision source codestrongly recommended for any single precision source code
Middle of the ground: -tp k8-64 –fast –MscalarsseMiddle of the ground: -tp k8-64 –fast –Mscalarsse
enables all of the most aggressive except vector code generation, which can reorder loops and generate slightly different results
in double precision source bases a good substitute since Opteron has the in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector codesame throughput on both scalar and vector code
Below are 3 different sets of recommended PGI compiler Below are 3 different sets of recommended PGI compiler flags for flag mining application source bases:flags for flag mining application source bases:
Most aggressive: -O3Most aggressive: -O3
loop transformations, instruction preference tuning, cache tiling, & SIMD code generation (CG). Generally provides the best performance but may cause compilation failure or slow performance in some cases
strongly recommended for any single precision source codestrongly recommended for any single precision source code
Middle of the ground: -O2Middle of the ground: -O2
enables most options by –O3, including SIMD CG, instruction preferences, common sub-expression elimination, & pipelining and unrolling.
in double precision source bases a good substitute since Opteron has the in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector codesame throughput on both scalar and vector code
Turn off Buffer Over Run CheckingTurn off Buffer Over Run Checking
The compiler by default runs on /GS to check for buffer overruns. Turning off checking by specifying /GS- may result in additional performance
April 18, 2023 Computation Products Group 37
Microsoft Compiler FlagsMicrosoft Compiler FlagsFunctionality FlagsFunctionality Flags
/GT/GT
enables run-time information
/Wp64 /Wp64
supports fiber safety for data allocated using static thread-local storage
/LD/LD
detects most 64-bit portability problems
/Oa/Oa
creates a dynamic-link library
/Ow/Ow
assumes aliasing across function calls but not inside functions
April 18, 2023 Computation Products Group 38
64-Bit Operating Systems64-Bit Operating SystemsRecommendations and StatusRecommendations and Status
SUSESUSE SLES 9 with latest Service Pack available SLES 9 with latest Service Pack available Has technology for supporting latest AMD processor featuresHas technology for supporting latest AMD processor features
Widest breadth of NUMA support and enabled by defaultWidest breadth of NUMA support and enabled by default
Oprofile system profiler installable as an RPM and modularizedOprofile system profiler installable as an RPM and modularized
complete support for static & dynamically linked 32-bit binariescomplete support for static & dynamically linked 32-bit binaries
Red Hat Enterprise Server 3.0 Service Pack 2 or laterRed Hat Enterprise Server 3.0 Service Pack 2 or later NUMA features support not as complete as that of NUMA features support not as complete as that of SUSE SLES 9SUSE SLES 9
Oprofile installable as an RPM but installation is not modularized Oprofile installable as an RPM but installation is not modularized and may require a kernel rebuild if RPM version isn’t satisfactoryand may require a kernel rebuild if RPM version isn’t satisfactory
only SP 2 or later has complete 32-bit shared object library only SP 2 or later has complete 32-bit shared object library support (a requirement to run all 32-bit binaries in 64-bit)support (a requirement to run all 32-bit binaries in 64-bit)
Posix-threading library changed between 2.1 and 3.0, may Posix-threading library changed between 2.1 and 3.0, may require users to rebuild applicationsrequire users to rebuild applications
64-bit LS-DYNA v543464-bit LS-DYNA v54343-Car Model Performance3-Car Model Performance
0
10000
20000
30000
40000
50000
60000
Wall
Clo
ck T
ime o
f Execu
tion
(low
er
is b
ett
er)
8P 16P 32P 64P
LS-DYNA 3-Car Benchmark Performance
IBM x335 2.8Ghz - Gigabit HP RX2600 I tanium 2 1.5 Ghz - Infiniband 2P Opteron 2Ghz - Infiniband
April 18, 2023 Computation Products Group 48
64-bit LS-DYNA v543464-bit LS-DYNA v54343-Car Model Performance3-Car Model Performance
0.40
0.50
0.60
0.70
0.80
0.90
1.00
1.10
1.20
1.30
1.40
1.50
1.60
Perf
orm
ance
Rela
tive t
o I
taniu
m 2
8P 16P 32P 64P
LS-DYNA 3-car Benchmark Performance Relative to I tanium 2
IBM x335 2.8Ghz - Gigabit HP RX2600 I tanium 2 1.5 Ghz - Infiniband 2P Opteron 2Ghz - Infiniband
April 18, 2023 Computation Products Group 49
AMD, the AMD Arrow Logo, AMD Opteron and combinations thereof are trademarks of Advanced Micro Devices, Inc. HyperTransport is a licensed trademark of the HyperTransport Technology Consortium. Other product names used in this presentation are for identification purposes only and may be trademarks of their respective companies.