Memoro Thierry Treyer Performance & Capacity Intern Mark Santaniello Performance & Capacity Engineer James Larus EPFL IC School Dean Scaling an LLVM-Based Heap Profiler 1
Memoro
Thierry TreyerPerformance & Capacity Intern
Mark SantanielloPerformance & Capacity Engineer
James LarusEPFL IC School Dean
Scaling an LLVM-Based Heap Profiler
1
vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) {
vector<BigT> values; values.reserve(largeMap.size());
for (const auto& key: keys) values.emplace_back(largeMap[key]);
return values; }
2
40 GiBof DRAM wasted per server
3
vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) {
vector<BigT> values; values.reserve(largeMap.size());
for (const auto& key: keys) values.emplace_back(largeMap[key]);
return values; }
4
vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) {
vector<BigT> values; values.reserve(largeMap.size());
for (const auto& key: keys) values.emplace_back(largeMap[key]);
return values; }
5
vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) {
vector<BigT> values; values.reserve(keys.size());
for (const auto& key: keys) values.emplace_back(largeMap[key]);
return values; }
6
LLVM
Sanitizers
Memoro
LLVM-Based Profiler
7
LLVM
Sanitizers
Memoro
Manipulate the IR
LLVM-Based Profiler
7
LLVM
Sanitizers
Memoro
Manipulate the IR
Infrastructure
LLVM-Based Profiler
7
LLVM
Sanitizers
Memoro
Manipulate the IR
Infrastructure
Collecting and Displaying data
LLVM-Based Profiler
7
Visualizer
Open Challenges
Run-Time Overhead
Memoro +
8
Source Code Compile Run Analyze
Overview
9
Source Code Compile Run Analyze
No modification
Overview
9
Source Code Compile Run Analyze
No modification Instrument loads/storesInstrument intrinsicsCollect types
Overview
INSTRUMENTATION PASS (LLVM)
9
Source Code Compile Run Analyze
No modification Instrument loads/stores Intercept alloc/freeInstrument intrinsicsCollect types
Intercept loads/storesIntercept syscallsCollect stats
Overview
INSTRUMENTATION PASS (LLVM)
RUN-TIME (COMPILER-RT)
9
Source Code Compile Run Analyze
No modification Instrument loads/stores Intercept alloc/free Score APInstrument intrinsicsCollect types
Intercept loads/storesIntercept syscalls
Guide exploration
Collect stats
Overview
INSTRUMENTATION PASS (LLVM)
RUN-TIME (COMPILER-RT)
VISUALIZER (ELECTRON)
9
Source Code Compile Run Analyze
No modification Instrument loads/stores Intercept alloc/free Score APInstrument intrinsicsCollect types
Intercept loads/storesIntercept syscalls
Guide exploration
Collect stats
Overview
INSTRUMENTATION PASS (LLVM)
RUN-TIME (COMPILER-RT)
VISUALIZER (ELECTRON)
9
Visualizer
Open Challenges
Run-Time Overhead
Memoro +
10
Visualizer
Open Challenges
Run-Time Overhead
Memoro +
10
1,000xslowdown due to Memoro's run-time
11
Run-Time Sampling
void interceptLoadStore(…) { // Sample accesses if (sample_count++ % access_sampling_rate != 0) return;
/* Process access... */ }
int sample_count = 0;
12
Run-Time Sampling
void interceptLoadStore(…) { // Sample accesses if (sample_count++ % access_sampling_rate != 0) return;
/* Process access... */ }
THREADLOCAL int sample_count = 0;
12
Power to the user!MEMORO_OPTIONS="…" ./myapp - access_sampling_rate - ...
// Public API: memoro_interface.h #include <memoro_interface.h> void foo(…) { MemoroFlags *mflags = memoro::getFlags(); mflags->access_sampling_rate = 50; /* ... */ }
13
99%
🕵
14
Time spent by address type
0% 25% 50% 75% 100%
Primary Heap Secondary Heap Not Heap
99%
🕵
14
Time spent by address type
0% 25% 50% 75% 100%
Primary Heap Secondary Heap Stack
99%
🕵
14
Secondary − large allocations O(n)
Primary O(1)
ld 0x…
MetadataAddrSize
First Access TimeAccess Range Low
…
🕵The Allocators
15
Secondary − large allocations O(n)
Primary O(1)
ld 0x…
MetadataAddrSize
First Access TimeAccess Range Low
…
🔒
🕵The Allocators
15
🕵Stack
…
Heap
Issue with non-heap addresses
16
🕵1. Allocators only know about heap Stack
…
Heap
Issue with non-heap addresses
16
🕵1. Allocators only know about heap
2. Traverse all allocations to discard them
Stack
…
Heap
Issue with non-heap addresses
16
🕵1. Allocators only know about heap
2. Traverse all allocations to discard them
3. Takes a global lock
Stack
…
Heap
Issue with non-heap addresses
16
🕵1. Allocators only know about heap
2. Traverse all allocations to discard them
3. Takes a global lock
Stack
…
Heap
Issue with non-heap addresses
0% 25% 50% 75% 100%Primary Heap Secondary Heap Stack
16
Stack
…
Heap
Run-Time Filter
17
1. Thread start: store stack top Stack
…
Heap
Run-Time Filter0xABCD
17
1. Thread start: store stack top
2. Get current stack bottom
Stack
…
Heap
Run-Time Filter0xABCD
0xAAAA
17
1. Thread start: store stack top
2. Get current stack bottom
3. Discard if Addr. in this range
Stack
…
Heap
Run-Time Filter0xABCD
0xAAAA0xAABB
0x1234
17
0.58%
0.58%Time spent by address type
0% 25% 50% 75% 100%
Primary Heap Secondary Heap Not heap Stack Filtered
99%
<2%
18
1,000xslowdown due to Memoro's run-time
19
5xslowdown due to Memoro's run-time
20
Visualizer
Open Challenges
Run-Time Overhead
Memoro +
21
Visualizer
Open Challenges
Run-Time Overhead
Memoro +
21
+ 100,000Stack Traces
+ 1BAllocations
22
23
Truncate
Scor
e
0%
10%
20%
30%
40%
Bin Size100 300 1k 3k 10k 30k 100k
24
Truncate
Scor
e
0%
10%
20%
30%
40%
Bin Size100 300 1k 3k 10k 30k 100k
HIDE
24
25
BEFORE AFTER
25
VS.foo() bar()
Death by a thousand cuts
main()
.
.
.
main()
.
.
.
26
VS.foo() bar()
Death by a thousand cuts
main()
.
.
.
main()
.
.
.
26
VS.foo() bar()
Death by a thousand cuts
main()
.
.
.
main()
.
.
.
main()
26
VS.foo() bar()
Death by a thousand cuts
main()
.
.
.
main()
.
.
.
.
26
VS.foo() bar()
Death by a thousand cuts
main()
.
.
.
main()
.
.
..
26
VS.foo() bar()
Death by a thousand cuts
main()
.
.
.
main()
.
.
. .
26
VS.foo() bar()
Death by a thousand cuts
main()
.
.
.
main()
.
.
.bar()
26
VS.foo() bar()
Death by a thousand cuts
main()
.
.
.
main()
.
.
.
26
VS.foo() bar()
Death by a thousand cuts
main()
.
.
.
main()
.
.
.
26
main() main()
Death by a thousand cuts
bar()
.
.
.
foo()
.
.
.
27
Memoro +
28
vector<BigT> getValues( map<Id, BigT>& largeMap, vector<Id>& keys) {
vector<BigT> values; values.reserve(largeMap.size());
for (const auto& key: keys) values.emplace_back(largeMap[key]);
return values; }
29
Demo
30
Visualizer
Open Challenges
Run-Time Overhead
Memoro +
31
Dumping Profile
Your regular service
32
Dumping Profile
AtExit()Your regular service
32
Dumping Profile
AtExit()Your regular service
32
Dumping Profile
AtExit()Facebookservice
32
Dumping Profile
AtExit()Facebookservice
32
Dumping Profile
AtExit()Facebookservice
32
Dumping Profile
AtExit()Facebookservice
lldb call AtExit()
32
Dumping Profile
AtExit()Facebookservice
lldb call AtExit()
32
Dumping Profile
AtExit()Facebookservice
lldb call AtExit()
a. Signal to dump (SIGPROF)
32
Dumping Profile
AtExit()Facebookservice
lldb call AtExit()
a. Signal to dump (SIGPROF)
b. Ring buffer + Periodic write
32
Compile-Time Stack Analysis
33
Compile-Time Stack Analysis
ld/st
33
Compile-Time Stack Analysis
llvm::GetUnderlyingObject()ld/st
33
Compile-Time Stack Analysis
llvm::GetUnderlyingObject()
Ratio
Inst
rum
ente
d lo
ad/st
ore
0
22500
45000
67500
90000
GetUnderlyingObject(depth = X)0 1 2 4 8
ld/st
33
Compile-Time Stack Analysis
llvm::GetUnderlyingObject()
Ratio
Inst
rum
ente
d lo
ad/st
ore
0
22500
45000
67500
90000
GetUnderlyingObject(depth = X)0 1 2 4 8
ld/st
bar()
foo()
33
Compile-Time Stack Analysis
llvm::GetUnderlyingObject()
Ratio
Inst
rum
ente
d lo
ad/st
ore
0
22500
45000
67500
90000
GetUnderlyingObject(depth = X)0 1 2 4 8
ld/st
bar()
foo()
33
Thank you!
Thierry TreyerPerformance & Capacity Intern
Mark SantanielloPerformance & Capacity Engineer
James LarusEPFL IC School Dean
34
github.com/epfl-vlsc/memoro