Top Banner
UNIVERSITY OF NIVERSITY OF MASSACHUSETTS ASSACHUSETTS, A , AMHERST MHERST Department of Computer Science Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts, Amherst
62

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

Mar 31, 2015

Download

Documents

Brian Roley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Memory Management for High-Performance

ApplicationsEmery Berger

University of Massachusetts, Amherst

Page 2: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

2UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

High-Performance Applications

Web servers, search engines, scientific codes

C or C++ (still…) Run on one or

cluster of server boxes

Raid drive

cpucpucpucpu

RAM

Raid drive

cpucpucpucpu

RAM

RAID drive

cpucpucpucpu

RAM

software

compiler

runtime system

operating system

hardware

Needs support at every level

Page 3: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

3UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

New Applications,Old Memory Managers

Applications and hardware have changed Multiprocessors now commonplace Object-oriented, multithreaded Increased pressure on memory manager

(malloc, free)

But memory managers have not kept up Inadequate support for modern

applications

Page 4: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

4UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Current Memory ManagersLimit Scalability

Runtime Performance

01234567

89

1011121314

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of Processors

Sp

eed

up

Ideal

Actual

As we add processors, program slows down

Caused by heap contention

Larson server benchmark on 14-processor Sun

Page 5: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

5UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

The Problem

Current memory managersinadequate for high-performance applications on modern architectures Limit scalability, application

performance, and robustness

Page 6: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

6UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

This Talk

Building memory managers Heap Layers framework [PLDI 2001]

Problems with current memory managers Contention, false sharing, space

Solution: provably scalable memory manager Hoard [ASPLOS-IX]

Extended memory manager for servers Reap [OOPSLA 2002]

Page 7: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

7UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Implementing Memory Managers

Memory managers must be Space efficient Very fast

Heavily-optimized code Hand-unrolled loops Macros Monolithic functions

Hard to write, reuse, or extend

Page 8: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

8UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Building Modular Memory Managers

Classes Overhead Rigid hierarchy

Mixins No overhead Flexible hierarchy

Page 9: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

9UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

A Heap Layer

template <class SuperHeap>class GreenHeapLayer :

public SuperHeap {…};

GreenHeapLayer

RedHeapLayer

Mixin with malloc & free methods

Page 10: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

10UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

LockedHeap

mallocHeapmallocHeap

Example:Thread-Safe Heap Layer

LockedHeap protect the superheap with a lock

LockedMallocHeap

mallocHeap

LockedHeap

Page 11: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

11UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Empirical Results

Heap Layers vs. originals: KingsleyHeap

vs. BSD allocator

LeaHeapvs. DLmalloc 2.7

Competitive runtime and memory efficiency

Runtime (normalized to Lea allocator)

0

0.25

0.5

0.75

1

1.25

1.5

cfrac espresso lindsay LRUsim perl roboop Average

BenchmarkN

orm

alized

Ru

nti

me

Kingsley KingsleyHeap Lea LeaHeap

Space (normalized to Lea allocator)

0

0.5

1

1.5

2

2.5

cfrac espresso lindsay LRUsim perl roboop Average

Benchmark

No

rmali

zed

Sp

ace

Kingsley KingsleyHeap Lea LeaHeap

Page 12: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

12UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Overview

Building memory managers Heap Layers framework

Problems with memory managers Contention, space, false sharing

Solution: provably scalable allocator Hoard

Extended memory manager for servers Reap

Page 13: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

13UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Problems with General-Purpose Memory Managers

Previous work for multiprocessors Concurrent single heap [Bigler et al. 85, Johnson 91,

Iyengar 92] Impractical

Multiple heaps [Larson 98, Gloger 99]

Reduce contention but cause other problems: P-fold or even unbounded increase in space Allocator-induced false sharing

we show

Page 14: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

14UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Multiple Heap Allocator:Pure Private Heaps

One heap per processor: malloc gets memory

from its local heap free puts memory

on its local heap

STL, Cilk, ad hoc

x1= malloc(1)

free(x1) free(x2)

x3= malloc(1)

x2= malloc(1)

x4= malloc(1)

processor 0 processor 1

= in use, processor 0

= free, on heap 1

free(x3) free(x4)

Key:

Page 15: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

15UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Problem:Unbounded Memory Consumption

Producer-consumer: Processor 0 allocates Processor 1 frees

Unbounded memory consumption Crash!

free(x1)

x2= malloc(1)

free(x2)

x1= malloc(1)processor 0 processor 1

x3= malloc(1)

free(x3)

Page 16: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

16UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Multiple Heap Allocator:Private Heaps with Ownership

free returns memory to original heap

Bounded memory consumption No crash!

“Ptmalloc” (Linux),LKmalloc

x1= malloc(1)

free(x1)

free(x2)

x2= malloc(1)

processor 0 processor 1

Page 17: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

17UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Problem:P-fold Memory Blowup

Occurs in practice Round-robin producer-

consumer processor i mod P

allocates processor (i+1) mod P

frees

Footprint = 1 (2GB),but space = 3 (6GB) Exceeds 32-bit address

space: Crash!

free(x2)

free(x1)

free(x3)

x1= malloc(1)

x2= malloc(1)

x3=malloc(1)

processor 0 processor 1 processor 2

Page 18: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

18UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Problem:Allocator-Induced False Sharing

False sharing Non-shared objects

on same cache line Bane of parallel

applications Extensively studied

All these allocatorscause false sharing!

CPU 0 CPU 1

cache cache

bus

processor 0 processor 1x2= malloc(1)x1= malloc(1)

cache line

thrash… thrash…

Page 19: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

19UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

So What Do We Do Now? Where do we put free memory?

on central heap: on our own heap:

(pure private heaps) on the original heap:

(private heaps with ownership)

How do we avoid false sharing?

Heap contention Unbounded

memory consumption

P-fold blowup

Page 20: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

20UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Overview

Building memory managers Heap Layers framework

Problems with memory managers Contention, space, false sharing

Solution: provably scalable allocator Hoard

Extended memory manager for servers Reap

Page 21: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

21UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Hoard: Key Insights Bound local memory consumption

Explicitly track utilization Move free memory to a global

heap Provably bounds memory

consumption

Manage memory in large chunks Avoids false sharing Reduces heap contention

Page 22: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

22UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Overview of Hoard

Manage memory in heap blocks Page-sized Avoids false sharing

Allocate from local heap block Avoids heap contention

Low utilization

Move heap block to global heap Avoids space blowup

global heap

processor 0 processor P-1

Page 23: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

23UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Summary of Analytical Results

Space consumption: near optimal worst-case

Hoard: O(n log M/m + P) {P « n} Optimal: O(n log M/m)

[Robson 70]: ≈ bin-packing

Private heaps with ownership:O(P n log M/m)

Provably low synchronization

n = memory requiredM = biggest object sizem = smallest object sizeP = processors

Page 24: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

24UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Empirical Results Measure runtime on 14-processor Sun

Allocators Solaris (system allocator) Ptmalloc (GNU libc) mtmalloc (Sun’s “MT-hot” allocator)

Micro-benchmarks Threadtest: no sharing Larson: sharing (server-style) Cache-scratch: mostly reads & writes

(tests for false sharing) Real application experience similar

Page 25: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

25UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Runtime Performance: threadtest

speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors)

Many threads,no sharing

Hoard achieves linear speedup

Page 26: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

26UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Runtime Performance: Larson

Many threads,sharing(server-style)

Hoard achieves linear speedup

Page 27: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

27UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Runtime Performance:false sharing

Many threads,mostly reads & writes of heap data

Hoard achieves linear speedup

Page 28: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

28UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Hoard in the “Real World” Open source code

www.hoard.org 13,000 downloads Solaris, Linux, Windows, IRIX, …

Widely used in industry AOL, British Telecom, Novell, Philips Reports: 2x-10x, “impressive” improvement in

performance Search server, telecom billing systems, scene

rendering,real-time messaging middleware, text-to-speech engine, telephony, JVM

Scalable general-purpose memory manager

Page 29: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

29UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Overview

Building memory managers Heap Layers framework

Problems with memory managers Contention, space, false sharing

Solution: provably scalable allocator Hoard

Extended memory manager for servers Reap

Page 30: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

30UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Custom Memory Allocation

Programmers often replace malloc/free Attempt to increase performance Provide extra functionality (e.g., for

servers) Reduce space (rarely)

Empirical study of custom allocators Lea allocator often as fast or faster Custom allocation ineffective,

except for regions. [OOPSLA 2002]

Page 31: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

31UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Overview of Regions

+ Fast+ Pointer-bumping

allocation+ Deletion of chunks

+ Convenient+ One call frees all memory

regionmalloc(r, sz)regiondelete(r)

Separate areas, deletion only en masse

regioncreate(r) r

- Risky- Accidental

deletion- Too much

space

Page 32: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

32UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Why Regions?

Apparently faster, more space-efficient

Servers need memory management support: Avoid resource leaks

Tear down memory associated with terminated connections or transactions

Current approach (e.g., Apache): regions

Page 33: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

33UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Drawbacks of Regions

Can’t reclaim memory within regions Problem for long-running computations,

producer-consumer patterns,off-the-shelf “malloc/free” programs

unbounded memory consumption

Current situation for Apache: vulnerable to denial-of-service limits runtime of connections limits module programming

Page 34: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

34UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Reap = region + heap Adds individual object deletion & heap

Reap Hybrid Allocator

reapmalloc(r, sz)

reapdelete(r)

reapcreate(r)r

reapfree(r,p)

Can reduce memory consumption Fast

Adapts to use (region or heap style) Cheap deletion

Page 35: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

35UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Using Reap as Regions

Runtime - Region-Based Benchmarks

0

0.5

1

1.5

2

2.5

lcc mudlle

No

rma

lize

d R

un

tim

e

Original Win32 DLmalloc WinHeap Vmalloc Reap

4.08

Reap performance nearly matches regions

Page 36: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

36UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Reap: Best of Both Worlds

Combining new/delete with regionsusually impossible:

Incompatible API’s Hard to rewrite code

Use Reap: Incorporate new/delete code into Apache “mod_bc” (arbitrary-precision calculator)

Changed 20 lines (out of 8000) Benchmark: compute 1000th prime

With Reap: 240K Without Reap: 7.4MB

Page 37: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

37UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Open Questions

Grand Unified Memory Manager? Hoard + Reap Integration with garbage collection

Effective Custom Allocators? Exploit sizes, lifetimes, locality and

sharing

Challenges of newer architectures NUMA, SMT/CMP, 64-bit, predication

Page 38: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

38UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Current Work: Robust Performance

Currently: no VM-GC communicaton BAD interactions under memory

pressure Our approach (with Eliot Moss, Scott

Kaplan):Cooperative Robust Automatic Memory Management

Garbage collector

/ allocator

Virtual memory manager

LRU queuememory pressure

empty pages

reduced impact

Page 39: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

39UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Current Work: Predictable VMM

Recent work on scheduling for QoS E.g., proportional-share Under memory pressure, VMM is

scheduler Paged-out processes may never recover Intermittent processes may wait long time

Scheduler-faithful virtual memory(with Scott Kaplan, Prashant Shenoy) Based on page value rather than order

Page 40: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

40UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

ConclusionMemory management for high-performance

applications Heap Layers framework [PLDI 2001]

Reusable components, no runtime cost Hoard scalable memory manager [ASPLOS-IX]

High-performance, provably scalable & space-efficient Reap hybrid memory manager [OOPSLA 2002]

Provides speed & robustness for server applications

Current work: robust memory management for multiprogramming

Page 41: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

41UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

The Obligatory URL Slide

http://www.cs.umass.edu/~emery

Page 42: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

42UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

If You Can Read This,I Went Too Far

Page 43: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

43UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Hoard: Under the Hood

MallocOrFreeHeap

PerProcessorHeap

SelectSizeHeap

LockedHeap

HeapBlockManager

LockedHeap

SuperblockHeap

LockedHeap

HeapBlockManager

SystemHeap

Largeobjects(> 4K)

FreeToHeapBlock

EmptyHeap Blocks

LockedHeap

HeapBlockManager

select heap based on size

malloc from local heap, free to heap block

get or return memory to global heap

Page 44: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

44UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Custom Memory Allocation

Very common practice Apache, gcc, lcc,

STL, database servers…

Language-level support in C++

Replace new/delete,bypassing general-purpose allocator Reduce runtime – often Expand functionality –

sometimes Reduce space – rarely

“Use custom allocators”

Page 45: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

45UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Drawbacks of Custom Allocators

Avoiding memory manager means: More code to maintain & debug Can’t use memory debuggers Not modular or robust:

Mix memory from customand general-purpose allocators → crash!

Increased burden on programmers

Page 46: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

46UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Overview

Introduction Perceived benefits and drawbacks Three main kinds of custom allocators Comparison with general-purpose

allocators Advantages and drawbacks of regions Reaps – generalization of regions &

heaps

Page 47: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

47UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Class1 free list

(1) Per-Class Allocators

a

b

c

a = new Class1;b = new Class1;c = new Class1;delete a;delete b;delete c;a = new Class1;b = new Class1;c = new Class1;

Recycle freed objects from a free list

+ Fast+ Linked list operations

+ Simple+ Identical semantics+ C++ language

support- Possibly space-

inefficient

Page 48: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

48UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

(II) Custom Patterns

Tailor-made to fit allocation patterns Example: 197.parser (natural

language parser)char[MEMORY_LIMIT]

a = xalloc(8);b = xalloc(16);c = xalloc(8);xfree(b);xfree(c);d = xalloc(8);

a b cd

end_of_arrayend_of_arrayend_of_arrayend_of_arrayend_of_arrayend_of_array

+ Fast+ Pointer-bumping allocation

- Brittle- Fixed memory size- Requires stack-like

lifetimes

Page 49: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

49UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

(III) Regions

+ Fast+ Pointer-bumping

allocation+ Deletion of chunks

+ Convenient+ One call frees all memory

regionmalloc(r, sz)regiondelete(r)

Separate areas, deletion only en masse

regioncreate(r) r

- Risky- Accidental

deletion- Too much

space

Page 50: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

50UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Overview

Introduction Perceived benefits and drawbacks Three main kinds of custom allocators Comparison with general-purpose

allocators Advantages and drawbacks of regions Reaps – generalization of regions &

heaps

Page 51: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

51UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Custom Allocators Are Faster…

Runtime - Custom Allocator Benchmarks

0

0.25

0.5

0.75

1

1.25

1.5

1.75

197.

pars

er

boxe

d-sim

c-br

eeze

175.

vpr

176.

gcc

apac

helcc

mud

lle

Non-re

gions

Regio

ns

Ove

rall

No

rma

lize

d R

un

tim

e

Custom Win32

non-regions regions averages

Page 52: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

52UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Not So Fast…

Runtime - Custom Allocator Benchmarks

0

0.25

0.5

0.75

1

1.25

1.5

1.75

197.

pars

er

boxe

d-sim

c-br

eeze

175.

vpr

176.

gcc

apac

he lcc

mud

lle

Non-re

gions

Regio

ns

Overa

ll

No

rma

lize

d R

un

tim

e

Custom Win32 DLmalloc

non-regions regions averages

Page 53: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

53UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

The Lea Allocator (DLmalloc 2.7.0)

Optimized for common allocation patterns Per-size quicklists ≈ per-class

allocation Deferred coalescing

(combining adjacent free objects) Highly-optimized fastpath Space-efficient

Page 54: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

54UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Space Consumption Results

Space - Custom Allocator Benchmarks

00.250.5

0.751

1.251.5

1.75

No

rmal

ized

Sp

ace

Original DLmalloc

regionsnon-regions averages

Page 55: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

55UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Overview

Introduction Perceived benefits and drawbacks Three main kinds of custom allocators Comparison with general-purpose

allocators Advantages and drawbacks of regions Reaps – generalization of regions &

heaps

Page 56: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

56UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Why Regions?

Apparently faster, more space-efficient

Servers need memory management support: Avoid resource leaks

Tear down memory associated with terminated connections or transactions

Current approach (e.g., Apache): regions

Page 57: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

57UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Drawbacks of Regions

Can’t reclaim memory within regions Problem for long-running computations,

producer-consumer patterns,off-the-shelf “malloc/free” programs

unbounded memory consumption

Current situation for Apache: vulnerable to denial-of-service limits runtime of connections limits module programming

Page 58: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

58UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Reap = region + heap Adds individual object deletion & heap

Reap Hybrid Allocator

reapmalloc(r, sz)

reapdelete(r)

reapcreate(r)r

reapfree(r,p)

Can reduce memory consumption+ Fast

+ Adapts to use (region or heap style)+ Cheap deletion

Page 59: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

59UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Using Reap as Regions

Runtime - Region-Based Benchmarks

0

0.5

1

1.5

2

2.5

lcc mudlle

No

rma

lize

d R

un

tim

e

Original Win32 DLmalloc WinHeap Vmalloc Reap

4.08

Reap performance nearly matches regions

Page 60: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

60UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Reap: Best of Both Worlds

Combining new/delete with regionsusually impossible:

Incompatible API’s Hard to rewrite code

Use Reap: Incorporate new/delete code into Apache “mod_bc” (arbitrary-precision calculator)

Changed 20 lines (out of 8000) Benchmark: compute 1000th prime

With Reap: 240K Without Reap: 7.4MB

Page 61: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

61UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Conclusion

Empirical study of custom allocators Lea allocator often as fast or faster Custom allocation ineffective,

except for regions Reaps:

Nearly matches region performancewithout other drawbacks

Take-home message: Stop using custom memory allocators!

Page 62: U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

62UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS, A, AMHERST • MHERST • Department of Computer Science Department of Computer Science

Software

http://www.cs.umass.edu/~emery

(part of Heap Layers distribution)