UNIVERSITY OF NIVERSITY OF MASSACHUSETTS ASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts Amherst
May 20, 2015
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science
Memory Managementfor High-Performance
ApplicationsEmery BergerUniversity of Massachusetts Amherst
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 2
High-Performance Applications
Web servers, search engines, scientific codes
C or C++ Run on one or
cluster of server boxes
Raid drive
cpucpucpucpu
RAM
Raid drive
cpucpucpucpu
RAM
RAID drive
cpucpucpucpu
RAM
software
compiler
runtime system
operating system
hardware
Needs support at every level
runtime system
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 3
New Applications,Old Memory Managers
Applications and hardware have changed Multiprocessors now commonplace Object-oriented, multithreaded Increased pressure on memory manager
(malloc, free)
But memory managers have not kept up Inadequate support for modern
applications
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 4
Current Memory ManagersLimit Scalability
Runtime Performance
01234567
89
1011121314
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of Processors
Sp
eed
up
Ideal
Actual
As we add processors, program slows down
Caused by heap contention
Larson server benchmark on 14-processor Sun
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 5
The Problem
Current memory managersinadequate for high-performance applications on modern architectures Limit scalability & application
performance
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 6
This Talk
Building memory managers Heap Layers framework
Problems with current memory managers Contention, false sharing, space
Solution: provably scalable memory manager Hoard
Extended memory manager for servers Reap
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 7
Implementing Memory Managers
Memory managers must be Space efficient Very fast
Heavily-optimized C code Hand-unrolled loops Macros Monolithic functions
Hard to write, reuse, or extend
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 8
Real Code: DLmalloc 2.7.2#define chunksize(p) ((p)->size & ~(SIZE_BITS))#define next_chunk(p) ((mchunkptr)( ((char*)(p)) + ((p)->size & ~PREV_INUSE) ))#define prev_chunk(p) ((mchunkptr)( ((char*)(p)) - ((p)->prev_size) ))#define chunk_at_offset(p, s) ((mchunkptr)(((char*)(p)) + (s)))#define inuse(p)\((((mchunkptr)(((char*)(p))+((p)->size & ~PREV_INUSE)))->size) & PREV_INUSE)#define set_inuse(p)\((mchunkptr)(((char*)(p)) + ((p)->size & ~PREV_INUSE)))->size |= PREV_INUSE#define clear_inuse(p)\((mchunkptr)(((char*)(p)) + ((p)->size & ~PREV_INUSE)))->size &= ~(PREV_INUSE)#define inuse_bit_at_offset(p, s)\ (((mchunkptr)(((char*)(p)) + (s)))->size & PREV_INUSE)#define set_inuse_bit_at_offset(p, s)\ (((mchunkptr)(((char*)(p)) + (s)))->size |= PREV_INUSE)#define MALLOC_ZERO(charp, nbytes) \do { \ INTERNAL_SIZE_T* mzp = (INTERNAL_SIZE_T*)(charp); \ CHUNK_SIZE_T mctmp = (nbytes)/sizeof(INTERNAL_SIZE_T); \ long mcn; \ if (mctmp < 8) mcn = 0; else { mcn = (mctmp-1)/8; mctmp %= 8; } \ switch (mctmp) { \ case 0: for(;;) { *mzp++ = 0; \ case 7: *mzp++ = 0; \ case 6: *mzp++ = 0; \ case 5: *mzp++ = 0; \ case 4: *mzp++ = 0; \ case 3: *mzp++ = 0; \ case 2: *mzp++ = 0; \ case 1: *mzp++ = 0; if(mcn <= 0) break; mcn--; } \ } \} while(0)
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 9
Programming Language Support
Classes Overhead Rigid
hierarchy
Mixins No overhead Flexible hierarchy Sounds great...
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 10
A Heap Layer
template <class SuperHeap>class GreenHeapLayer :
public SuperHeap {…};
GreenHeapLayer
RedHeapLayer
C++ mixin with malloc & free methods
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 11
LockedHeap
mallocHeap
Example: Thread-Safe Heap Layer
LockedHeap protect the superheap with a lock
LockedMallocHeap
mallocHeap
LockedHeap
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 12
Empirical Results
Heap Layers vs. originals: KingsleyHeap
vs. BSD allocator
LeaHeapvs. DLmalloc 2.7
Competitive runtime and memory efficiency
Runtime (normalized to Lea allocator)
0
0.25
0.5
0.75
1
1.25
1.5
cfrac espresso lindsay LRUsim perl roboop Average
BenchmarkN
orm
alized
Ru
nti
me
Kingsley KingsleyHeap Lea LeaHeap
Space (normalized to Lea allocator)
0
0.5
1
1.5
2
2.5
cfrac espresso lindsay LRUsim perl roboop Average
Benchmark
No
rmali
zed
Sp
ace
Kingsley KingsleyHeap Lea LeaHeap
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 13
Overview
Building memory managers Heap Layers framework
Problems with memory managers Contention, space, false sharing
Solution: provably scalable allocator Hoard
Extended memory manager for servers Reap
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 14
Problems with General-Purpose Memory Managers
Previous work for multiprocessors Concurrent single heap [Bigler et al. 85, Johnson 91,
Iyengar 92] Impractical
Multiple heaps [Larson 98, Gloger 99]
Reduce contention but cause other problems: P-fold or even unbounded increase in space Allocator-induced false sharing
we show
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 15
Multiple Heap Allocator:Pure Private Heaps
One heap per processor: malloc gets memory
from its local heap free puts memory
on its local heap
STL, Cilk, ad hoc
x1= malloc(1)
free(x1) free(x2)
x3= malloc(1)
x2= malloc(1)
x4= malloc(1)
processor 0 processor 1
= in use, processor 0
= free, on heap 1
free(x3) free(x4)
Key:
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 16
Problem:Unbounded Memory Consumption
Producer-consumer: Processor 0 allocates Processor 1 frees
Unbounded memory consumption Crash!
free(x1)
x2= malloc(1)
free(x2)
x1= malloc(1)processor 0 processor 1
x3= malloc(1)
free(x3)
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 17
Multiple Heap Allocator:Private Heaps with Ownership
free returns memory to original heap
Bounded memory consumption No crash!
“Ptmalloc” (Linux),LKmalloc
x1= malloc(1)
free(x1)
free(x2)
x2= malloc(1)
processor 0 processor 1
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 18
Problem:P-fold Memory Blowup
Occurs in practice Round-robin producer-
consumer processor i mod P
allocates processor (i+1) mod P
frees
Footprint = 1 (2GB),but space = 3 (6GB) Exceeds 32-bit address
space: Crash!
free(x2)
free(x1)
free(x3)
x1= malloc(1)
x2= malloc(1)
x3=malloc(1)
processor 0 processor 1 processor 2
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 19
Problem:Allocator-Induced False Sharing
False sharing Non-shared objects
on same cache line Bane of parallel
applications Extensively studied
All these allocatorscause false sharing!
CPU 0 CPU 1
cache cache
bus
processor 0 processor 1x2= malloc(1)x1= malloc(1)
cache line
thrash… thrash…
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 20
So What Do We Do Now? Where do we put free memory?
on central heap: on our own heap:
(pure private heaps) on the original heap:
(private heaps with ownership)
How do we avoid false sharing?
Heap contention Unbounded
memory consumption
P-fold blowup
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 21
Overview
Building memory managers Heap Layers framework
Problems with memory managers Contention, space, false sharing
Solution: provably scalable allocator Hoard
Extended memory manager for servers Reap
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 22
Hoard: Key Insights Bound local memory consumption
Explicitly track utilization Move free memory to a global
heap Provably bounds memory
consumption
Manage memory in large chunks Avoids false sharing Reduces heap contention
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 23
Overview of Hoard
Manage memory in heap blocks Page-sized Avoids false sharing
Allocate from local heap block Avoids heap contention
Low utilization
Move heap block to global heap Avoids space blowup
global heap
…
processor 0 processor P-1
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 24
Summary of Analytical Results
Space consumption: near optimal worst-case
Hoard: O(n log M/m + P) {P « n} Optimal: O(n log M/m)
[Robson 70]
Private heaps with ownership:O(P n log M/m)
Provably low synchronization
n = memory requiredM = biggest object sizem = smallest object sizeP = processors
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 25
Empirical Results Measure runtime on 14-processor Sun
Allocators Solaris (system allocator) Ptmalloc (GNU libc) mtmalloc (Sun’s “MT-hot” allocator)
Micro-benchmarks Threadtest: no sharing Larson: sharing (server-style) Cache-scratch: mostly reads & writes
(tests for false sharing) Real application experience similar
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 26
Runtime Performance:threadtest
speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors)
Many threads,no sharing
Hoard achieves linear speedup
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 27
Runtime Performance:Larson
Many threads,sharing(server-style)
Hoard achieves linear speedup
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 28
Runtime Performance:false sharing
Many threads,mostly reads & writes of heap data
Hoard achieves linear speedup
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 29
Hoard in the “Real World” Open source code
www.hoard.org 13,000 downloads Solaris, Linux, Windows, IRIX, …
Widely used in industry AOL, British Telecom, Novell, Philips Reports: 2x-10x, “impressive” improvement in
performance Search server, telecom billing systems, scene
rendering,real-time messaging middleware, text-to-speech engine, telephony, JVM
Scalable general-purpose memory manager
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 30
Overview
Building memory managers Heap Layers framework
Problems with memory managers Contention, space, false sharing
Solution: provably scalable allocator Hoard
Extended memory manager for servers Reap
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 31
Custom Memory Allocation
Very common practice Apache, gcc, lcc,
STL, database servers…
Language-level support in C++
Replace new/delete,bypassing general-purpose allocator Reduce runtime – often Expand functionality –
sometimes Reduce space – rarely
“Use custom allocators”
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 32
Runtime - Custom Allocator Benchmarks
0
0.25
0.5
0.75
1
1.25
1.5
1.75
197.
pars
er
boxe
d-sim
c-br
eeze
175.
vpr
176.
gcc
apac
he lcc
mud
lle
Non-re
gions
Regio
ns
Overa
ll
No
rma
lize
d R
un
tim
e
Custom Win32 DLmalloc
non-regions regions averages
The Reality
Lea allocator often as fast or faster
Custom allocation ineffective,except for regions. [OOPSLA 2002]
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 33
Overview of Regions
+ Fast+ Pointer-bumping
allocation+ Deletion of chunks
+ Convenient+ One call frees all memory
regionmalloc(r, sz)regiondelete(r)
Separate areas, deletion only en masseregioncreate(r) r
- Risky- Accidental
deletion- Too much
space
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 34
Why Regions?
Apparently faster, more space-efficient
Servers need memory management support: Avoid resource leaks
Tear down memory associated with terminated connections or transactions
Current approach (e.g., Apache): regions
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 35
Drawbacks of Regions
Can’t reclaim memory within regions Problem for long-running computations,
producer-consumer patterns,off-the-shelf “malloc/free” programs
unbounded memory consumption
Current situation for Apache: vulnerable to denial-of-service limits runtime of connections limits module programming
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 36
Reap = region + heap Adds individual object deletion & heap
Reap Hybrid Allocator
reapmalloc(r, sz)
reapdelete(r)
reapcreate(r)r
reapfree(r,p)
Can reduce memory consumption Fast
Adapts to use (region or heap style) Cheap deletion
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 37
Using Reap as Regions
Runtime - Region-Based Benchmarks
0
0.5
1
1.5
2
2.5
lcc mudlle
No
rma
lize
d R
un
tim
e
Original Win32 DLmalloc WinHeap Vmalloc Reap
4.08
Reap performance nearly matches regions
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 38
Reap: Best of Both Worlds
Combining new/delete with regionsusually impossible:
Incompatible API’s Hard to rewrite code
Use Reap: Incorporate new/delete code into Apache “mod_bc” (arbitrary-precision calculator)
Changed 20 lines (out of 8000) Benchmark: compute 1000th prime
With Reap: 240K Without Reap: 7.4MB
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 39
Summary
Building memory managers Heap Layers framework [PLDI 2001]
Problems with current memory managers Contention, false sharing, space
Solution: provably scalable memory manager Hoard [ASPLOS-IX]
Extended memory manager for servers Reap [OOPSLA 2002]
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 40
Current Projects
CRAMM: Cooperative Robust Automatic Memory Management
Garbage collection without paging Automatic heap sizing
SAVMM: Scheduler-Aware Virtual Memory Management
Markov: Programming language for building high-
performance servers
COLA: Customizable Object Layout Algorithms Improving locality in Java
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 41
www.cs.umass.edu/~plasma
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 42
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 43
Looking Forward
“New” programming languages Increasing use of Java = garbage
collection New architectures
NUMA: SMT/CMP (“hyperthreading”)
Technology trends Memory hierarchy
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 44
The Ever-SteeperMemory Hierarchy
Higher = smaller, faster, closer to CPU A real desktop machine (mine)
registers
L1 cache
L2 cache
RAM
Disk
8 integer, 8 floating-point; 1-cycle latency
8K data & instructions; 2-cycle latency
512K; 7-cycle latency
1GB; 100 cycle latency
40 GB; 38,000,000 cycle latency (!)
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 45
Swapping & Throughput
Heap > available memory - throughput plummets
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 46
Why Manage Memory At All?
Just buy more! Simplifies memory management
Still have to collect garbage eventually…
Workload fits in RAM = no more swapping!
Sounds great…
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 47
Memory Prices Over Time
RAM Prices Over Time(1977 dollars)
$0.01
$0.10
$1.00
$10.00
$100.00
$1,000.00
$10,000.00
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
Year
Do
llars
per
GB
2K
8K
32K
128K
512K
2M
8M
conventional DRAM
“Soon it will be free…”
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 48
Memory Prices: Inflection Point!
RAM Prices Over Time(1977 dollars)
$0.01
$0.10
$1.00
$10.00
$100.00
$1,000.00
$10,000.00
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
Year
Do
llars
per
GB
2K
8K
32K
128K
512K
2M
8M
512M
1G
SDRAM,RDRAM,DDR,Chipkill
conventional DRAM
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 49
Memory Is Actually Expensive
Desktops: Most ship with 256MB 1GB = 50% more $$ Laptops = 70%, if
possible Limited capacity
Servers: Buy 4GB, get 1 CPU
free! Sun Enterprise 10000:
8GB extra = $150,000!
Fast RAM – new technologies
Cosmic rays…
8GB Sun RAM = 1 Ferrari Modena
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 50
Key Problem: Paging
Garbage collectors: VM oblivious GC disrupts LRU queue Touches non-resident pages
Virtual memory managers: GC oblivious Likely to evict pages needed by GC
Paging Orders of magnitude more time than RAM BIG hit in performance and LONG pauses
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 51
Cooperative Robust AutomaticMemory Management (CRAMM)
Joint work: Eliot Moss (UMass), Scott Kaplan (Amherst College)
Garbage collector Virtual memory manager
Adjusts heap size
Coarse-grained(heap-level)
Evacuates pagesSelects victim pages
Fine-grained(page-level)
Tracks per-process,overall
memory utilization
Page replacement
change inmemory pressure
new heap size
page eviction notification
victim page(s)
I’m a cooperativeapplication!
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 52
Fine-Grained Cooperative GC
Goal: GC triggers no additional paging Key ideas:
Adapt collection strategy on-the-fly Page-oriented memory management Exploit detailed page information from VM
Garbage collector Virtual memory manager
Evacuates pagesSelects victim pages
Fine-grained
Page replacement
page eviction notification
victim page(s)
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 53
Summary
Building memory managers Heap Layers framework
Problems with memory managers Contention, space, false sharing
Solution: provably scalable allocator Hoard
Future directions
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 54
If You Have to Spend $$...
more memory: bad more Ferraris: good
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 55www.cs.umass.edu/~emery/plasma
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 56
This Page Intentionally Left Blank
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 57
Virtual Memory Manager Support
New VM required: detailed page-level information
“Segmented queue” for low-overhead
Local LRU order per-process, not gLRU (Linux)
Complementary to SAVM work:“Scheduler-Aware Virtual Memory manager”
Under development – modified Linux kernel
unprotected protected
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 58
Current Work: Robust Performance
Currently: no VM-GC communicaton BAD interactions under memory
pressure Our approach (with Eliot Moss, Scott
Kaplan):Cooperative Robust Automatic Memory Management
Garbage collector
/ allocator
Virtual memory manager
LRU queuememory pressure
empty pages
reduced impact
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 59
Current Work: Predictable VMM
Recent work on scheduling for QoS E.g., proportional-share Under memory pressure, VMM is
scheduler Paged-out processes may never recover Intermittent processes may wait long time
Scheduler-faithful virtual memory(with Scott Kaplan, Prashant Shenoy) Based on page value rather than order
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 60
ConclusionMemory management for high-performance
applications Heap Layers framework [PLDI 2001]
Reusable components, no runtime cost Hoard scalable memory manager [ASPLOS-IX]
High-performance, provably scalable & space-efficient Reap hybrid memory manager [OOPSLA 2002]
Provides speed & robustness for server applications
Current work: robust memory management for multiprogramming
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 61
The Obligatory URL Slide
http://www.cs.umass.edu/~emery
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 62
If You Can Read This,I Went Too Far
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 63
Hoard: Under the Hood
MallocOrFreeHeap
PerProcessorHeap
SelectSizeHeap
LockedHeap
HeapBlockManager
LockedHeap
SuperblockHeap
LockedHeap
HeapBlockManager
SystemHeap
Largeobjects(> 4K)
FreeToHeapBlock
EmptyHeap Blocks
LockedHeap
HeapBlockManager
select heap based on size
malloc from local heap, free to heap block
get or return memory to global heap
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 64
Custom Memory Allocation
Very common practice Apache, gcc, lcc,
STL, database servers…
Language-level support in C++
Replace new/delete,bypassing general-purpose allocator Reduce runtime – often Expand functionality –
sometimes Reduce space – rarely
“Use custom allocators”
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 65
Drawbacks of Custom Allocators
Avoiding memory manager means: More code to maintain & debug Can’t use memory debuggers Not modular or robust:
Mix memory from customand general-purpose allocators → crash!
Increased burden on programmers
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 66
Overview
Introduction Perceived benefits and drawbacks Three main kinds of custom allocators Comparison with general-purpose
allocators Advantages and drawbacks of regions Reaps – generalization of regions &
heaps
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 67
Class1 free list
(1) Per-Class Allocators
a
b
c
a = new Class1;b = new Class1;c = new Class1;delete a;delete b;delete c;a = new Class1;b = new Class1;c = new Class1;
Recycle freed objects from a free list
+ Fast+ Linked list
operations + Simple
+ Identical semantics
+ C++ language support
- Possibly space-inefficient
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 68
(II) Custom Patterns
Tailor-made to fit allocation patterns Example: 197.parser (natural
language parser)char[MEMORY_LIMIT]
a = xalloc(8);b = xalloc(16);c = xalloc(8);xfree(b);xfree(c);d = xalloc(8);
a b cd
end_of_arrayend_of_arrayend_of_arrayend_of_arrayend_of_arrayend_of_array
+ Fast+ Pointer-bumping
allocation
- Brittle- Fixed memory size- Requires stack-like
lifetimes
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 69
(III) Regions
+ Fast+ Pointer-bumping
allocation+ Deletion of chunks
+ Convenient+ One call frees all memory
regionmalloc(r, sz)regiondelete(r)
Separate areas, deletion only en masseregioncreate(r) r
- Risky- Accidental
deletion- Too much
space
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 70
Overview
Introduction Perceived benefits and drawbacks Three main kinds of custom allocators Comparison with general-purpose
allocators Advantages and drawbacks of regions Reaps – generalization of regions &
heaps
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 71
Custom Allocators Are Faster…
Runtime - Custom Allocator Benchmarks
0
0.25
0.5
0.75
1
1.25
1.5
1.75
197.
pars
er
boxe
d-sim
c-br
eeze
175.
vpr
176.
gcc
apac
helcc
mud
lle
Non-re
gions
Regio
ns
Ove
rall
No
rma
lize
d R
un
tim
e
Custom Win32
non-regions regions averages
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 72
Not So Fast…
Runtime - Custom Allocator Benchmarks
0
0.25
0.5
0.75
1
1.25
1.5
1.75
197.
pars
er
boxe
d-sim
c-br
eeze
175.
vpr
176.
gcc
apac
he lcc
mud
lle
Non-re
gions
Regio
ns
Overa
ll
No
rma
lize
d R
un
tim
e
Custom Win32 DLmalloc
non-regions regions averages
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 73
The Lea Allocator (DLmalloc 2.7.0)
Optimized for common allocation patterns Per-size quicklists ≈ per-class
allocation Deferred coalescing
(combining adjacent free objects) Highly-optimized fastpath Space-efficient
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 74
Space Consumption Results
Space - Custom Allocator Benchmarks
00.250.5
0.751
1.251.5
1.75
No
rmal
ized
Sp
ace
Original DLmalloc
regionsnon-regions averages
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 75
Overview
Introduction Perceived benefits and drawbacks Three main kinds of custom allocators Comparison with general-purpose
allocators Advantages and drawbacks of regions Reaps – generalization of regions &
heaps
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 76
Why Regions?
Apparently faster, more space-efficient
Servers need memory management support: Avoid resource leaks
Tear down memory associated with terminated connections or transactions
Current approach (e.g., Apache): regions
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 77
Drawbacks of Regions
Can’t reclaim memory within regions Problem for long-running computations,
producer-consumer patterns,off-the-shelf “malloc/free” programs
unbounded memory consumption
Current situation for Apache: vulnerable to denial-of-service limits runtime of connections limits module programming
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 78
Reap = region + heap Adds individual object deletion & heap
Reap Hybrid Allocator
reapmalloc(r, sz)
reapdelete(r)
reapcreate(r)r
reapfree(r,p)
Can reduce memory consumption+ Fast
+ Adapts to use (region or heap style)+ Cheap deletion
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 79
Using Reap as Regions
Runtime - Region-Based Benchmarks
0
0.5
1
1.5
2
2.5
lcc mudlle
No
rma
lize
d R
un
tim
e
Original Win32 DLmalloc WinHeap Vmalloc Reap
4.08
Reap performance nearly matches regions
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 80
Reap: Best of Both Worlds
Combining new/delete with regionsusually impossible:
Incompatible API’s Hard to rewrite code
Use Reap: Incorporate new/delete code into Apache “mod_bc” (arbitrary-precision calculator)
Changed 20 lines (out of 8000) Benchmark: compute 1000th prime
With Reap: 240K Without Reap: 7.4MB
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 81
Conclusion
Empirical study of custom allocators Lea allocator often as fast or faster Custom allocation ineffective,
except for regions Reaps:
Nearly matches region performancewithout other drawbacks
Take-home message: Stop using custom memory allocators!
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 82
Software
http://www.cs.umass.edu/~emery
(part of Heap Layers distribution)
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 83
Experimental Methodology
Comparing to general-purpose allocators Same semantics: no problem
E.g., disable per-class allocators Different semantics: use emulator
Uses general-purpose allocatorbut adds bookkeeping
regionfree: Free all associated objects Other functionality (nesting, obstacks)
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 84
Use Custom Allocators?
Strongly recommended by practitioners
Little hard data on performance/space improvements Only one previous study [Zorn 1992] Focused on just one type of allocator Custom allocators: waste of time
Small gains, bad allocators Different allocators better? Trade-
offs?
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 85
Kinds of Custom Allocators
Three basic types of custom allocators Per-class
Fast Custom patterns
Fast, but very special-purpose Regions
Fast, possibly more space-efficient Convenient Variants: nested, obstacks
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 86
Optimization Opportunity
Time Spent in Memory Operations
0
20
40
60
80
100
% o
f ru
nti
me
Memory Operations Other
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 87
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 88
Custom Memory Allocation
Programmers often replace malloc/free Attempt to increase performance Provide extra functionality (e.g., for
servers) Reduce space (rarely)
Empirical study of custom allocators Lea allocator often as fast or faster Custom allocation ineffective,
except for regions. [OOPSLA 2002]
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 89
Overview of Regions
+ Fast+ Pointer-bumping
allocation+ Deletion of chunks
+ Convenient+ One call frees all memory
regionmalloc(r, sz)regiondelete(r)
Separate areas, deletion only en masseregioncreate(r) r
- Risky- Accidental
deletion- Too much
space
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 90
Why Regions?
Apparently faster, more space-efficient
Servers need memory management support: Avoid resource leaks
Tear down memory associated with terminated connections or transactions
Current approach (e.g., Apache): regions
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 91
Drawbacks of Regions
Can’t reclaim memory within regions Problem for long-running computations,
producer-consumer patterns,off-the-shelf “malloc/free” programs
unbounded memory consumption
Current situation for Apache: vulnerable to denial-of-service limits runtime of connections limits module programming
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 92
Reap = region + heap Adds individual object deletion & heap
Reap Hybrid Allocator
reapmalloc(r, sz)
reapdelete(r)
reapcreate(r)r
reapfree(r,p)
Can reduce memory consumption Fast
Adapts to use (region or heap style) Cheap deletion
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 93
Using Reap as Regions
Runtime - Region-Based Benchmarks
0
0.5
1
1.5
2
2.5
lcc mudlle
No
rma
lize
d R
un
tim
e
Original Win32 DLmalloc WinHeap Vmalloc Reap
4.08
Reap performance nearly matches regions
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 94
Reap: Best of Both Worlds
Combining new/delete with regionsusually impossible:
Incompatible API’s Hard to rewrite code
Use Reap: Incorporate new/delete code into Apache “mod_bc” (arbitrary-precision calculator)
Changed 20 lines (out of 8000) Benchmark: compute 1000th prime
With Reap: 240K Without Reap: 7.4MB