Advanced Modular Software Performance Monitoring CPU profiling with Intel® VTune™ Amplifier XE Alexander Mazurov Ferrara University, CERN
May 21, 2015
Advanced Modular Software Performance Monitoring
CPU profiling with Intel® VTune™ Amplifier XE
Alexander MazurovFerrara University, CERN
2
I. Event Processing Software II. Profilers III. Intel® VTune™ Amplifier XE IV. Gaudi Framework V. Gaudi Intel Profiler Auditor VI. Profiling examples
3
Physics events
The Higgs Boson
Simulation * Trigger * Analysis
I. Event Processing Software
4
Detectorevents
Events to storage
106 events/sec 4500 events/sec
LHCb High Level Trigger (HLT) Software
Moore
5
II. Profilers
Collect information related to how an application or
system perform.
6
Measure frequency and duration of functions calls and/or code
instructions.
CPU Profiler
7
Profiling Techniques
- Hardware counters- Instrumenting the code
8
Hardware countersExploit hardware performance counters from Performance Monitoring Unit (PMU)
Counters: - Translation lookaside buffer (TLB) misses - Cache misses - Stall cycles - Memory access latency - ...
Perfmon2 * Intel VTune Amplifier
9
Instrumenting the code
- Statically: * Change code manually / automatically * Compiler assisted (gcc -pg)
- Dynamically (at runtime): * Change code in runtime - Valgrind - Google Performance Tools - Intel VTune Amplifier
10
III. VTune™ Amplifier XEPerformance Profiling Tool
- x86 (32 and 64-bit)- GUI and CLI
11
VTune™ FeaturesRuntime instrumenting profiler
- User-mode sampling- Hardware-based sampling- Concurrency and locks and waits analysis- Threading timeline- Attach to a running process- Source view
12
1) Interupts a process2) Collect samples of all active instruction addresses
3) Restore a call sequence upon each sample.
How user-mode sampling works?
13
User-mode analysis types
- Hotspots- Concurrency- Locks and Waits
14
User-mode samplingHotspots analysis:
15
Group results
16
Call Stack
17
Filter by timeline
18
CPU time by code line
Debug mode (-g)
19
User-mode sampling is a statistical method and does not provide a 100% accurate results.
Accuracy depends on:- Duration of the collection- Speed of processor- Amount of software activity- Sampling interval * recommended value is 10 ms * profiling is only 5% slower
Sampling Accuracy
20
Integrating VTune™ Amplifierto Event Processing Framework
21
IV. GaudiEvent processing framework
MooreTrigger
GaussSimulation
BrunelReconstruction
OnlineMonitoring
and commissioningDaVinci
Physicsanalysis
22
Gaudi Architecture
Algorithms * Services * Tools
23
Moore Event LoopHlt1DiMuonHighMassFilterSequence Hlt1DiMuonHighMassStreamer FastVeloHlt MuonRec Velo2CandidatesDiMuonHighMass GECLooseUnit createITLiteClusters createVeloLiteClusters
Algorithms Sequence
How to profile algorithms?
24
V. Gaudi Intel Profiling Auditor
VTune™ User API +
Gaudi Auditors API
25
VTune™ User API
- Start/Pause profiling- Mark profiling regions
26
Gaudi Auditors API
Algorithm
Start event End event
Callback functions
27
Algorithms profiling (I)
CPU time per sequence branch
28
Algorithms profiling (II)
29
Gaudi configuration
from Configurables import IntelProfilerAuditorprofiler = IntelProfilerAuditor()profiler.StartFromEventN = 5000 profiler.StopAtEventN = 15000AuditorSvc().Auditors += [profiler]
30
Run: $> intelprofiler -o /collected/data job.py
Analyze (GUI): $> amplxe-gui /collecter/data/r001hs
Analyze (CLI): $> amplxe-cl -reports hotspots -r /collecter/data/r001hs
31
VI. Profiling examples
1. Memory allocation functions2. Measuring profiling accuracy3. Custom reports
32
1. Memory allocation functionsoperatornew from libstdc++ library:
tc_new from tcmalloc library:
tc_new uses twice less time then operatornew
33
2. Measuring profiling accuracy
Intel Profiling Auditorvs .
Timing AuditorMeasures the absolute time of
algorithm's run1000 events
34
3. Custom reportsBuild reports using CSV files exported
from VTune Amplifier
35
Conclusions
Intel® VTune™ Amplifier XE:
+ Various analysis types and reports + Rich User API + Reasonable overhead time