Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science

Memory Managementfor High-Performance

ApplicationsEmery BergerUniversity of Massachusetts Amherst

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 2

High-Performance Applications

Web servers, search engines, scientific codes

C or C++ Run on one or

cluster of server boxes

Raid drive

cpucpucpucpu

RAM

Raid drive

cpucpucpucpu

RAM

RAID drive

cpucpucpucpu

RAM

software

compiler

runtime system

operating system

hardware

Needs support at every level

runtime system


New Applications,Old Memory Managers

Applications and hardware have changed Multiprocessors now commonplace Object-oriented, multithreaded Increased pressure on memory manager

(malloc, free)

But memory managers have not kept up Inadequate support for modern

applications


Current Memory ManagersLimit Scalability

Runtime Performance

01234567

89

1011121314

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of Processors

Sp

eed

up

Ideal

Actual

As we add processors, program slows down

Caused by heap contention

Larson server benchmark on 14-processor Sun


The Problem

Current memory managersinadequate for high-performance applications on modern architectures Limit scalability & application

performance


This Talk

Building memory managers Heap Layers framework

Problems with current memory managers Contention, false sharing, space

Solution: provably scalable memory manager Hoard

Extended memory manager for servers Reap


Implementing Memory Managers

Memory managers must be Space efficient Very fast

Heavily-optimized C code Hand-unrolled loops Macros Monolithic functions

Hard to write, reuse, or extend


Real Code: DLmalloc 2.7.2#define chunksize(p) ((p)->size & ~(SIZE_BITS))#define next_chunk(p) ((mchunkptr)( ((char*)(p)) + ((p)->size & ~PREV_INUSE) ))#define prev_chunk(p) ((mchunkptr)( ((char*)(p)) - ((p)->prev_size) ))#define chunk_at_offset(p, s) ((mchunkptr)(((char*)(p)) + (s)))#define inuse(p)\((((mchunkptr)(((char*)(p))+((p)->size & ~PREV_INUSE)))->size) & PREV_INUSE)#define set_inuse(p)\((mchunkptr)(((char*)(p)) + ((p)->size & ~PREV_INUSE)))->size |= PREV_INUSE#define clear_inuse(p)\((mchunkptr)(((char*)(p)) + ((p)->size & ~PREV_INUSE)))->size &= ~(PREV_INUSE)#define inuse_bit_at_offset(p, s)\ (((mchunkptr)(((char*)(p)) + (s)))->size & PREV_INUSE)#define set_inuse_bit_at_offset(p, s)\ (((mchunkptr)(((char*)(p)) + (s)))->size |= PREV_INUSE)#define MALLOC_ZERO(charp, nbytes) \do { \ INTERNAL_SIZE_T* mzp = (INTERNAL_SIZE_T*)(charp); \ CHUNK_SIZE_T mctmp = (nbytes)/sizeof(INTERNAL_SIZE_T); \ long mcn; \ if (mctmp < 8) mcn = 0; else { mcn = (mctmp-1)/8; mctmp %= 8; } \ switch (mctmp) { \ case 0: for(;;) { *mzp++ = 0; \ case 7: *mzp++ = 0; \ case 6: *mzp++ = 0; \ case 5: *mzp++ = 0; \ case 4: *mzp++ = 0; \ case 3: *mzp++ = 0; \ case 2: *mzp++ = 0; \ case 1: *mzp++ = 0; if(mcn <= 0) break; mcn--; } \ } \} while(0)


Programming Language Support

Classes Overhead Rigid

hierarchy

Mixins No overhead Flexible hierarchy Sounds great...


A Heap Layer

template <class SuperHeap>class GreenHeapLayer :

public SuperHeap {…};

GreenHeapLayer

RedHeapLayer

C++ mixin with malloc & free methods


LockedHeap

mallocHeap

Example: Thread-Safe Heap Layer

LockedHeap protect the superheap with a lock

LockedMallocHeap

mallocHeap

LockedHeap


Empirical Results

Heap Layers vs. originals: KingsleyHeap

vs. BSD allocator

LeaHeapvs. DLmalloc 2.7

Competitive runtime and memory efficiency

Runtime (normalized to Lea allocator)

0

0.25

0.5

0.75

1

1.25

1.5

cfrac espresso lindsay LRUsim perl roboop Average

BenchmarkN

orm

alized

Ru

nti

me

Kingsley KingsleyHeap Lea LeaHeap

Space (normalized to Lea allocator)

0

0.5

1

1.5

2

2.5

cfrac espresso lindsay LRUsim perl roboop Average

Benchmark

No

rmali

zed

Sp

ace

Kingsley KingsleyHeap Lea LeaHeap


Overview


Problems with memory managers Contention, space, false sharing

Solution: provably scalable allocator Hoard



Problems with General-Purpose Memory Managers

Previous work for multiprocessors Concurrent single heap [Bigler et al. 85, Johnson 91,

Iyengar 92] Impractical

Multiple heaps [Larson 98, Gloger 99]

Reduce contention but cause other problems: P-fold or even unbounded increase in space Allocator-induced false sharing

we show


Multiple Heap Allocator:Pure Private Heaps

One heap per processor: malloc gets memory

from its local heap free puts memory

on its local heap

STL, Cilk, ad hoc

x1= malloc(1)

free(x1) free(x2)

x3= malloc(1)

x2= malloc(1)

x4= malloc(1)

processor 0 processor 1

= in use, processor 0

= free, on heap 1

free(x3) free(x4)

Key:


Problem:Unbounded Memory Consumption

Producer-consumer: Processor 0 allocates Processor 1 frees

Unbounded memory consumption Crash!

free(x1)

x2= malloc(1)

free(x2)

x1= malloc(1)processor 0 processor 1

x3= malloc(1)

free(x3)


Multiple Heap Allocator:Private Heaps with Ownership

free returns memory to original heap

Bounded memory consumption No crash!

“Ptmalloc” (Linux),LKmalloc

x1= malloc(1)

free(x1)

free(x2)

x2= malloc(1)

processor 0 processor 1


Problem:P-fold Memory Blowup

Occurs in practice Round-robin producer-

consumer processor i mod P

allocates processor (i+1) mod P

frees

Footprint = 1 (2GB),but space = 3 (6GB) Exceeds 32-bit address

space: Crash!

free(x2)

free(x1)

free(x3)

x1= malloc(1)

x2= malloc(1)

x3=malloc(1)

processor 0 processor 1 processor 2


Problem:Allocator-Induced False Sharing

False sharing Non-shared objects

on same cache line Bane of parallel

applications Extensively studied

All these allocatorscause false sharing!

CPU 0 CPU 1

cache cache

bus

processor 0 processor 1x2= malloc(1)x1= malloc(1)

cache line

thrash… thrash…


So What Do We Do Now? Where do we put free memory?

on central heap: on our own heap:

(pure private heaps) on the original heap:

(private heaps with ownership)

How do we avoid false sharing?

Heap contention Unbounded

memory consumption

P-fold blowup


Overview






Hoard: Key Insights Bound local memory consumption

Explicitly track utilization Move free memory to a global

heap Provably bounds memory

consumption

Manage memory in large chunks Avoids false sharing Reduces heap contention


Overview of Hoard

Manage memory in heap blocks Page-sized Avoids false sharing

Allocate from local heap block Avoids heap contention

Low utilization

Move heap block to global heap Avoids space blowup

global heap

…

processor 0 processor P-1


Summary of Analytical Results

Space consumption: near optimal worst-case

Hoard: O(n log M/m + P) {P « n} Optimal: O(n log M/m)

[Robson 70]

Private heaps with ownership:O(P n log M/m)

Provably low synchronization

n = memory requiredM = biggest object sizem = smallest object sizeP = processors


Empirical Results Measure runtime on 14-processor Sun

Allocators Solaris (system allocator) Ptmalloc (GNU libc) mtmalloc (Sun’s “MT-hot” allocator)

Micro-benchmarks Threadtest: no sharing Larson: sharing (server-style) Cache-scratch: mostly reads & writes

(tests for false sharing) Real application experience similar


Runtime Performance:threadtest

speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors)

Many threads,no sharing

Hoard achieves linear speedup


Runtime Performance:Larson

Many threads,sharing(server-style)



Runtime Performance:false sharing

Many threads,mostly reads & writes of heap data



Hoard in the “Real World” Open source code

www.hoard.org 13,000 downloads Solaris, Linux, Windows, IRIX, …

Widely used in industry AOL, British Telecom, Novell, Philips Reports: 2x-10x, “impressive” improvement in

performance Search server, telecom billing systems, scene

rendering,real-time messaging middleware, text-to-speech engine, telephony, JVM

Scalable general-purpose memory manager


Overview






Custom Memory Allocation

Very common practice Apache, gcc, lcc,

STL, database servers…

Language-level support in C++

Replace new/delete,bypassing general-purpose allocator Reduce runtime – often Expand functionality –

sometimes Reduce space – rarely

“Use custom allocators”


Runtime - Custom Allocator Benchmarks

0

0.25

0.5

0.75

1

1.25

1.5

1.75

197.

pars

er

boxe

d-sim

c-br

eeze

175.

vpr

176.

gcc

apac

he lcc

mud

lle

Non-re

gions

Regio

ns

Overa

ll

No

rma

lize

d R

un

tim

e

Custom Win32 DLmalloc

non-regions regions averages

The Reality

Lea allocator often as fast or faster

Custom allocation ineffective,except for regions. [OOPSLA 2002]


Overview of Regions

+ Fast+ Pointer-bumping

allocation+ Deletion of chunks

+ Convenient+ One call frees all memory

regionmalloc(r, sz)regiondelete(r)

Separate areas, deletion only en masseregioncreate(r) r

- Risky- Accidental

deletion- Too much

space


Why Regions?

Apparently faster, more space-efficient

Servers need memory management support: Avoid resource leaks

Tear down memory associated with terminated connections or transactions

Current approach (e.g., Apache): regions


Drawbacks of Regions

Can’t reclaim memory within regions Problem for long-running computations,

producer-consumer patterns,off-the-shelf “malloc/free” programs

unbounded memory consumption

Current situation for Apache: vulnerable to denial-of-service limits runtime of connections limits module programming


Reap = region + heap Adds individual object deletion & heap

Reap Hybrid Allocator

reapmalloc(r, sz)

reapdelete(r)

reapcreate(r)r

reapfree(r,p)

Can reduce memory consumption Fast

Adapts to use (region or heap style) Cheap deletion


Using Reap as Regions

Runtime - Region-Based Benchmarks

0

0.5

1

1.5

2

2.5

lcc mudlle

No

rma

lize

d R

un

tim

e

Original Win32 DLmalloc WinHeap Vmalloc Reap

4.08

Reap performance nearly matches regions


Reap: Best of Both Worlds

Combining new/delete with regionsusually impossible:

Incompatible API’s Hard to rewrite code

Use Reap: Incorporate new/delete code into Apache “mod_bc” (arbitrary-precision calculator)

Changed 20 lines (out of 8000) Benchmark: compute 1000th prime

With Reap: 240K Without Reap: 7.4MB


Summary

Building memory managers Heap Layers framework [PLDI 2001]

Problems with current memory managers Contention, false sharing, space

Solution: provably scalable memory manager Hoard [ASPLOS-IX]

Extended memory manager for servers Reap [OOPSLA 2002]


Current Projects

CRAMM: Cooperative Robust Automatic Memory Management

Garbage collection without paging Automatic heap sizing

SAVMM: Scheduler-Aware Virtual Memory Management

Markov: Programming language for building high-

performance servers

COLA: Customizable Object Layout Algorithms Improving locality in Java


www.cs.umass.edu/~plasma



Looking Forward

“New” programming languages Increasing use of Java = garbage

collection New architectures

NUMA: SMT/CMP (“hyperthreading”)

Technology trends Memory hierarchy


The Ever-SteeperMemory Hierarchy

Higher = smaller, faster, closer to CPU A real desktop machine (mine)

registers

L1 cache

L2 cache

RAM

Disk

8 integer, 8 floating-point; 1-cycle latency

8K data & instructions; 2-cycle latency

512K; 7-cycle latency

1GB; 100 cycle latency

40 GB; 38,000,000 cycle latency (!)


Swapping & Throughput

Heap > available memory - throughput plummets


Why Manage Memory At All?

Just buy more! Simplifies memory management

Still have to collect garbage eventually…

Workload fits in RAM = no more swapping!

Sounds great…


Memory Prices Over Time

RAM Prices Over Time(1977 dollars)

$0.01

$0.10

$1.00

$10.00

$100.00

$1,000.00

$10,000.00

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

Year

Do

llars

per

GB

2K

8K

32K

128K

512K

2M

8M

conventional DRAM

“Soon it will be free…”


Memory Prices: Inflection Point!

RAM Prices Over Time(1977 dollars)

$0.01

$0.10

$1.00

$10.00

$100.00

$1,000.00

$10,000.00

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

Year

Do

llars

per

GB

2K

8K

32K

128K

512K

2M

8M

512M

1G

SDRAM,RDRAM,DDR,Chipkill

conventional DRAM


Memory Is Actually Expensive

Desktops: Most ship with 256MB 1GB = 50% more $$ Laptops = 70%, if

possible Limited capacity

Servers: Buy 4GB, get 1 CPU

free! Sun Enterprise 10000:

8GB extra = $150,000!

Fast RAM – new technologies

Cosmic rays…

8GB Sun RAM = 1 Ferrari Modena


Key Problem: Paging

Garbage collectors: VM oblivious GC disrupts LRU queue Touches non-resident pages

Virtual memory managers: GC oblivious Likely to evict pages needed by GC

Paging Orders of magnitude more time than RAM BIG hit in performance and LONG pauses


Cooperative Robust AutomaticMemory Management (CRAMM)

Joint work: Eliot Moss (UMass), Scott Kaplan (Amherst College)

Garbage collector Virtual memory manager

Adjusts heap size

Coarse-grained(heap-level)

Evacuates pagesSelects victim pages

Fine-grained(page-level)

Tracks per-process,overall

memory utilization

Page replacement

change inmemory pressure

new heap size

page eviction notification

victim page(s)

I’m a cooperativeapplication!


Fine-Grained Cooperative GC

Goal: GC triggers no additional paging Key ideas:

Adapt collection strategy on-the-fly Page-oriented memory management Exploit detailed page information from VM

Garbage collector Virtual memory manager

Evacuates pagesSelects victim pages

Fine-grained

Page replacement

page eviction notification

victim page(s)


Summary




Future directions


If You Have to Spend $$...

more memory: bad more Ferraris: good

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 55www.cs.umass.edu/~emery/plasma


This Page Intentionally Left Blank


Virtual Memory Manager Support

New VM required: detailed page-level information

“Segmented queue” for low-overhead

Local LRU order per-process, not gLRU (Linux)

Complementary to SAVM work:“Scheduler-Aware Virtual Memory manager”

Under development – modified Linux kernel

unprotected protected


Current Work: Robust Performance

Currently: no VM-GC communicaton BAD interactions under memory

pressure Our approach (with Eliot Moss, Scott

Kaplan):Cooperative Robust Automatic Memory Management

Garbage collector

/ allocator

Virtual memory manager

LRU queuememory pressure

empty pages

reduced impact


Current Work: Predictable VMM

Recent work on scheduling for QoS E.g., proportional-share Under memory pressure, VMM is

scheduler Paged-out processes may never recover Intermittent processes may wait long time

Scheduler-faithful virtual memory(with Scott Kaplan, Prashant Shenoy) Based on page value rather than order


ConclusionMemory management for high-performance

applications Heap Layers framework [PLDI 2001]

Reusable components, no runtime cost Hoard scalable memory manager [ASPLOS-IX]

High-performance, provably scalable & space-efficient Reap hybrid memory manager [OOPSLA 2002]

Provides speed & robustness for server applications

Current work: robust memory management for multiprogramming


The Obligatory URL Slide

http://www.cs.umass.edu/~emery


If You Can Read This,I Went Too Far


Hoard: Under the Hood

MallocOrFreeHeap

PerProcessorHeap

SelectSizeHeap

LockedHeap

HeapBlockManager

LockedHeap

SuperblockHeap

LockedHeap

HeapBlockManager

SystemHeap

Largeobjects(> 4K)

FreeToHeapBlock

EmptyHeap Blocks

LockedHeap

HeapBlockManager

select heap based on size

malloc from local heap, free to heap block

get or return memory to global heap



Very common practice Apache, gcc, lcc,

STL, database servers…

Language-level support in C++

Replace new/delete,bypassing general-purpose allocator Reduce runtime – often Expand functionality –

sometimes Reduce space – rarely

“Use custom allocators”


Drawbacks of Custom Allocators

Avoiding memory manager means: More code to maintain & debug Can’t use memory debuggers Not modular or robust:

Mix memory from customand general-purpose allocators → crash!

Increased burden on programmers


Overview

Introduction Perceived benefits and drawbacks Three main kinds of custom allocators Comparison with general-purpose

allocators Advantages and drawbacks of regions Reaps – generalization of regions &

heaps


Class1 free list

(1) Per-Class Allocators

a

b

c

a = new Class1;b = new Class1;c = new Class1;delete a;delete b;delete c;a = new Class1;b = new Class1;c = new Class1;

Recycle freed objects from a free list

+ Fast+ Linked list

operations + Simple

+ Identical semantics

+ C++ language support

- Possibly space-inefficient


(II) Custom Patterns

Tailor-made to fit allocation patterns Example: 197.parser (natural

language parser)char[MEMORY_LIMIT]

a = xalloc(8);b = xalloc(16);c = xalloc(8);xfree(b);xfree(c);d = xalloc(8);

a b cd

end_of_arrayend_of_arrayend_of_arrayend_of_arrayend_of_arrayend_of_array


allocation

- Brittle- Fixed memory size- Requires stack-like

lifetimes


(III) Regions






- Risky- Accidental

deletion- Too much

space


Overview



heaps


Custom Allocators Are Faster…


0

0.25

0.5

0.75

1

1.25

1.5

1.75

197.

pars

er

boxe

d-sim

c-br

eeze

175.

vpr

176.

gcc

apac

helcc

mud

lle

Non-re

gions

Regio

ns

Ove

rall

No

rma

lize

d R

un

tim

e

Custom Win32



Not So Fast…


0

0.25

0.5

0.75

1

1.25

1.5

1.75

197.

pars

er

boxe

d-sim

c-br

eeze

175.

vpr

176.

gcc

apac

he lcc

mud

lle

Non-re

gions

Regio

ns

Overa

ll

No

rma

lize

d R

un

tim

e

Custom Win32 DLmalloc



The Lea Allocator (DLmalloc 2.7.0)

Optimized for common allocation patterns Per-size quicklists ≈ per-class

allocation Deferred coalescing

(combining adjacent free objects) Highly-optimized fastpath Space-efficient


Space Consumption Results

Space - Custom Allocator Benchmarks

00.250.5

0.751

1.251.5

1.75

No

rmal

ized

Sp

ace

Original DLmalloc

regionsnon-regions averages


Overview



heaps


Why Regions?














reapmalloc(r, sz)

reapdelete(r)

reapcreate(r)r

reapfree(r,p)

Can reduce memory consumption+ Fast

+ Adapts to use (region or heap style)+ Cheap deletion




0

0.5

1

1.5

2

2.5

lcc mudlle

No

rma

lize

d R

un

tim

e


4.08










Conclusion

Empirical study of custom allocators Lea allocator often as fast or faster Custom allocation ineffective,

except for regions Reaps:

Nearly matches region performancewithout other drawbacks

Take-home message: Stop using custom memory allocators!


Software

http://www.cs.umass.edu/~emery

(part of Heap Layers distribution)


Experimental Methodology

Comparing to general-purpose allocators Same semantics: no problem

E.g., disable per-class allocators Different semantics: use emulator

Uses general-purpose allocatorbut adds bookkeeping

regionfree: Free all associated objects Other functionality (nesting, obstacks)


Use Custom Allocators?

Strongly recommended by practitioners

Little hard data on performance/space improvements Only one previous study [Zorn 1992] Focused on just one type of allocator Custom allocators: waste of time

Small gains, bad allocators Different allocators better? Trade-

offs?


Kinds of Custom Allocators

Three basic types of custom allocators Per-class

Fast Custom patterns

Fast, but very special-purpose Regions

Fast, possibly more space-efficient Convenient Variants: nested, obstacks


Optimization Opportunity

Time Spent in Memory Operations

0

20

40

60

80

100

% o

f ru

nti

me

Memory Operations Other




Programmers often replace malloc/free Attempt to increase performance Provide extra functionality (e.g., for

servers) Reduce space (rarely)

Empirical study of custom allocators Lea allocator often as fast or faster Custom allocation ineffective,

except for regions. [OOPSLA 2002]


Overview of Regions






- Risky- Accidental

deletion- Too much

space


Why Regions?














reapmalloc(r, sz)

reapdelete(r)

reapcreate(r)r

reapfree(r,p)

Can reduce memory consumption Fast

Adapts to use (region or heap style) Cheap deletion




0

0.5

1

1.5

2

2.5

lcc mudlle

No

rma

lize

d R

un

tim

e


4.08









Memory Management for High-Performance Applications

Technology

d e f i n e s e t

s i z e p p s i z e

d e f i n e p r e v

d e f i n e n e x t

s s i z e prev

nus e s i z e prev

d e f i n e mal

d e f i n e c h u n