Top Banner
UNIVERSITY OF NIVERSITY OF MASSACHUSETTS ASSACHUSETTS A AMHERST MHERST Department of Computer Science Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts Amherst
94

Memory Management for High-Performance Applications

May 20, 2015

Download

Technology

Emery Berger

Fast and effective memory management is crucial for many applications, including web servers, database managers, and scientific codes. However, current memory managers do not provide adequate support for these applications on modern architectures, severely limiting their performance, scalability, and robustness.

In this talk, I describe how to design memory managers that support high-performance applications. I first address the software engineering challenges of building efficient memory managers. I then show how current general-purpose memory managers do not scale on multiprocessors, cause false sharing of heap objects, and systematically leak memory. I describe a fast, provably scalable general-purpose memory manager called Hoard (available at www.hoard.org) that solves these problems, improving performance by up to a factor of 60.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science

Memory Managementfor High-Performance

ApplicationsEmery BergerUniversity of Massachusetts Amherst

Page 2: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 2

High-Performance Applications

Web servers, search engines, scientific codes

C or C++ Run on one or

cluster of server boxes

Raid drive

cpucpucpucpu

RAM

Raid drive

cpucpucpucpu

RAM

RAID drive

cpucpucpucpu

RAM

software

compiler

runtime system

operating system

hardware

Needs support at every level

runtime system

Page 3: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 3

New Applications,Old Memory Managers

Applications and hardware have changed Multiprocessors now commonplace Object-oriented, multithreaded Increased pressure on memory manager

(malloc, free)

But memory managers have not kept up Inadequate support for modern

applications

Page 4: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 4

Current Memory ManagersLimit Scalability

Runtime Performance

01234567

89

1011121314

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of Processors

Sp

eed

up

Ideal

Actual

As we add processors, program slows down

Caused by heap contention

Larson server benchmark on 14-processor Sun

Page 5: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 5

The Problem

Current memory managersinadequate for high-performance applications on modern architectures Limit scalability & application

performance

Page 6: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 6

This Talk

Building memory managers Heap Layers framework

Problems with current memory managers Contention, false sharing, space

Solution: provably scalable memory manager Hoard

Extended memory manager for servers Reap

Page 7: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 7

Implementing Memory Managers

Memory managers must be Space efficient Very fast

Heavily-optimized C code Hand-unrolled loops Macros Monolithic functions

Hard to write, reuse, or extend

Page 8: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 8

Real Code: DLmalloc 2.7.2#define chunksize(p) ((p)->size & ~(SIZE_BITS))#define next_chunk(p) ((mchunkptr)( ((char*)(p)) + ((p)->size & ~PREV_INUSE) ))#define prev_chunk(p) ((mchunkptr)( ((char*)(p)) - ((p)->prev_size) ))#define chunk_at_offset(p, s) ((mchunkptr)(((char*)(p)) + (s)))#define inuse(p)\((((mchunkptr)(((char*)(p))+((p)->size & ~PREV_INUSE)))->size) & PREV_INUSE)#define set_inuse(p)\((mchunkptr)(((char*)(p)) + ((p)->size & ~PREV_INUSE)))->size |= PREV_INUSE#define clear_inuse(p)\((mchunkptr)(((char*)(p)) + ((p)->size & ~PREV_INUSE)))->size &= ~(PREV_INUSE)#define inuse_bit_at_offset(p, s)\ (((mchunkptr)(((char*)(p)) + (s)))->size & PREV_INUSE)#define set_inuse_bit_at_offset(p, s)\ (((mchunkptr)(((char*)(p)) + (s)))->size |= PREV_INUSE)#define MALLOC_ZERO(charp, nbytes) \do { \ INTERNAL_SIZE_T* mzp = (INTERNAL_SIZE_T*)(charp); \ CHUNK_SIZE_T mctmp = (nbytes)/sizeof(INTERNAL_SIZE_T); \ long mcn; \ if (mctmp < 8) mcn = 0; else { mcn = (mctmp-1)/8; mctmp %= 8; } \ switch (mctmp) { \ case 0: for(;;) { *mzp++ = 0; \ case 7: *mzp++ = 0; \ case 6: *mzp++ = 0; \ case 5: *mzp++ = 0; \ case 4: *mzp++ = 0; \ case 3: *mzp++ = 0; \ case 2: *mzp++ = 0; \ case 1: *mzp++ = 0; if(mcn <= 0) break; mcn--; } \ } \} while(0)

Page 9: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 9

Programming Language Support

Classes Overhead Rigid

hierarchy

Mixins No overhead Flexible hierarchy Sounds great...

Page 10: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 10

A Heap Layer

template <class SuperHeap>class GreenHeapLayer :

public SuperHeap {…};

GreenHeapLayer

RedHeapLayer

C++ mixin with malloc & free methods

Page 11: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 11

LockedHeap

mallocHeap

Example: Thread-Safe Heap Layer

LockedHeap protect the superheap with a lock

LockedMallocHeap

mallocHeap

LockedHeap

Page 12: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 12

Empirical Results

Heap Layers vs. originals: KingsleyHeap

vs. BSD allocator

LeaHeapvs. DLmalloc 2.7

Competitive runtime and memory efficiency

Runtime (normalized to Lea allocator)

0

0.25

0.5

0.75

1

1.25

1.5

cfrac espresso lindsay LRUsim perl roboop Average

BenchmarkN

orm

alized

Ru

nti

me

Kingsley KingsleyHeap Lea LeaHeap

Space (normalized to Lea allocator)

0

0.5

1

1.5

2

2.5

cfrac espresso lindsay LRUsim perl roboop Average

Benchmark

No

rmali

zed

Sp

ace

Kingsley KingsleyHeap Lea LeaHeap

Page 13: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 13

Overview

Building memory managers Heap Layers framework

Problems with memory managers Contention, space, false sharing

Solution: provably scalable allocator Hoard

Extended memory manager for servers Reap

Page 14: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 14

Problems with General-Purpose Memory Managers

Previous work for multiprocessors Concurrent single heap [Bigler et al. 85, Johnson 91,

Iyengar 92] Impractical

Multiple heaps [Larson 98, Gloger 99]

Reduce contention but cause other problems: P-fold or even unbounded increase in space Allocator-induced false sharing

we show

Page 15: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 15

Multiple Heap Allocator:Pure Private Heaps

One heap per processor: malloc gets memory

from its local heap free puts memory

on its local heap

STL, Cilk, ad hoc

x1= malloc(1)

free(x1) free(x2)

x3= malloc(1)

x2= malloc(1)

x4= malloc(1)

processor 0 processor 1

= in use, processor 0

= free, on heap 1

free(x3) free(x4)

Key:

Page 16: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 16

Problem:Unbounded Memory Consumption

Producer-consumer: Processor 0 allocates Processor 1 frees

Unbounded memory consumption Crash!

free(x1)

x2= malloc(1)

free(x2)

x1= malloc(1)processor 0 processor 1

x3= malloc(1)

free(x3)

Page 17: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 17

Multiple Heap Allocator:Private Heaps with Ownership

free returns memory to original heap

Bounded memory consumption No crash!

“Ptmalloc” (Linux),LKmalloc

x1= malloc(1)

free(x1)

free(x2)

x2= malloc(1)

processor 0 processor 1

Page 18: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 18

Problem:P-fold Memory Blowup

Occurs in practice Round-robin producer-

consumer processor i mod P

allocates processor (i+1) mod P

frees

Footprint = 1 (2GB),but space = 3 (6GB) Exceeds 32-bit address

space: Crash!

free(x2)

free(x1)

free(x3)

x1= malloc(1)

x2= malloc(1)

x3=malloc(1)

processor 0 processor 1 processor 2

Page 19: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 19

Problem:Allocator-Induced False Sharing

False sharing Non-shared objects

on same cache line Bane of parallel

applications Extensively studied

All these allocatorscause false sharing!

CPU 0 CPU 1

cache cache

bus

processor 0 processor 1x2= malloc(1)x1= malloc(1)

cache line

thrash… thrash…

Page 20: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 20

So What Do We Do Now? Where do we put free memory?

on central heap: on our own heap:

(pure private heaps) on the original heap:

(private heaps with ownership)

How do we avoid false sharing?

Heap contention Unbounded

memory consumption

P-fold blowup

Page 21: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 21

Overview

Building memory managers Heap Layers framework

Problems with memory managers Contention, space, false sharing

Solution: provably scalable allocator Hoard

Extended memory manager for servers Reap

Page 22: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 22

Hoard: Key Insights Bound local memory consumption

Explicitly track utilization Move free memory to a global

heap Provably bounds memory

consumption

Manage memory in large chunks Avoids false sharing Reduces heap contention

Page 23: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 23

Overview of Hoard

Manage memory in heap blocks Page-sized Avoids false sharing

Allocate from local heap block Avoids heap contention

Low utilization

Move heap block to global heap Avoids space blowup

global heap

processor 0 processor P-1

Page 24: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 24

Summary of Analytical Results

Space consumption: near optimal worst-case

Hoard: O(n log M/m + P) {P « n} Optimal: O(n log M/m)

[Robson 70]

Private heaps with ownership:O(P n log M/m)

Provably low synchronization

n = memory requiredM = biggest object sizem = smallest object sizeP = processors

Page 25: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 25

Empirical Results Measure runtime on 14-processor Sun

Allocators Solaris (system allocator) Ptmalloc (GNU libc) mtmalloc (Sun’s “MT-hot” allocator)

Micro-benchmarks Threadtest: no sharing Larson: sharing (server-style) Cache-scratch: mostly reads & writes

(tests for false sharing) Real application experience similar

Page 26: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 26

Runtime Performance:threadtest

speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors)

Many threads,no sharing

Hoard achieves linear speedup

Page 27: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 27

Runtime Performance:Larson

Many threads,sharing(server-style)

Hoard achieves linear speedup

Page 28: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 28

Runtime Performance:false sharing

Many threads,mostly reads & writes of heap data

Hoard achieves linear speedup

Page 29: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 29

Hoard in the “Real World” Open source code

www.hoard.org 13,000 downloads Solaris, Linux, Windows, IRIX, …

Widely used in industry AOL, British Telecom, Novell, Philips Reports: 2x-10x, “impressive” improvement in

performance Search server, telecom billing systems, scene

rendering,real-time messaging middleware, text-to-speech engine, telephony, JVM

Scalable general-purpose memory manager

Page 30: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 30

Overview

Building memory managers Heap Layers framework

Problems with memory managers Contention, space, false sharing

Solution: provably scalable allocator Hoard

Extended memory manager for servers Reap

Page 31: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 31

Custom Memory Allocation

Very common practice Apache, gcc, lcc,

STL, database servers…

Language-level support in C++

Replace new/delete,bypassing general-purpose allocator Reduce runtime – often Expand functionality –

sometimes Reduce space – rarely

“Use custom allocators”

Page 32: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 32

Runtime - Custom Allocator Benchmarks

0

0.25

0.5

0.75

1

1.25

1.5

1.75

197.

pars

er

boxe

d-sim

c-br

eeze

175.

vpr

176.

gcc

apac

he lcc

mud

lle

Non-re

gions

Regio

ns

Overa

ll

No

rma

lize

d R

un

tim

e

Custom Win32 DLmalloc

non-regions regions averages

The Reality

Lea allocator often as fast or faster

Custom allocation ineffective,except for regions. [OOPSLA 2002]

Page 33: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 33

Overview of Regions

+ Fast+ Pointer-bumping

allocation+ Deletion of chunks

+ Convenient+ One call frees all memory

regionmalloc(r, sz)regiondelete(r)

Separate areas, deletion only en masseregioncreate(r) r

- Risky- Accidental

deletion- Too much

space

Page 34: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 34

Why Regions?

Apparently faster, more space-efficient

Servers need memory management support: Avoid resource leaks

Tear down memory associated with terminated connections or transactions

Current approach (e.g., Apache): regions

Page 35: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 35

Drawbacks of Regions

Can’t reclaim memory within regions Problem for long-running computations,

producer-consumer patterns,off-the-shelf “malloc/free” programs

unbounded memory consumption

Current situation for Apache: vulnerable to denial-of-service limits runtime of connections limits module programming

Page 36: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 36

Reap = region + heap Adds individual object deletion & heap

Reap Hybrid Allocator

reapmalloc(r, sz)

reapdelete(r)

reapcreate(r)r

reapfree(r,p)

Can reduce memory consumption Fast

Adapts to use (region or heap style) Cheap deletion

Page 37: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 37

Using Reap as Regions

Runtime - Region-Based Benchmarks

0

0.5

1

1.5

2

2.5

lcc mudlle

No

rma

lize

d R

un

tim

e

Original Win32 DLmalloc WinHeap Vmalloc Reap

4.08

Reap performance nearly matches regions

Page 38: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 38

Reap: Best of Both Worlds

Combining new/delete with regionsusually impossible:

Incompatible API’s Hard to rewrite code

Use Reap: Incorporate new/delete code into Apache “mod_bc” (arbitrary-precision calculator)

Changed 20 lines (out of 8000) Benchmark: compute 1000th prime

With Reap: 240K Without Reap: 7.4MB

Page 39: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 39

Summary

Building memory managers Heap Layers framework [PLDI 2001]

Problems with current memory managers Contention, false sharing, space

Solution: provably scalable memory manager Hoard [ASPLOS-IX]

Extended memory manager for servers Reap [OOPSLA 2002]

Page 40: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 40

Current Projects

CRAMM: Cooperative Robust Automatic Memory Management

Garbage collection without paging Automatic heap sizing

SAVMM: Scheduler-Aware Virtual Memory Management

Markov: Programming language for building high-

performance servers

COLA: Customizable Object Layout Algorithms Improving locality in Java

Page 41: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 41

www.cs.umass.edu/~plasma

Page 42: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 42

Page 43: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 43

Looking Forward

“New” programming languages Increasing use of Java = garbage

collection New architectures

NUMA: SMT/CMP (“hyperthreading”)

Technology trends Memory hierarchy

Page 44: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 44

The Ever-SteeperMemory Hierarchy

Higher = smaller, faster, closer to CPU A real desktop machine (mine)

registers

L1 cache

L2 cache

RAM

Disk

8 integer, 8 floating-point; 1-cycle latency

8K data & instructions; 2-cycle latency

512K; 7-cycle latency

1GB; 100 cycle latency

40 GB; 38,000,000 cycle latency (!)

Page 45: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 45

Swapping & Throughput

Heap > available memory - throughput plummets

Page 46: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 46

Why Manage Memory At All?

Just buy more! Simplifies memory management

Still have to collect garbage eventually…

Workload fits in RAM = no more swapping!

Sounds great…

Page 47: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 47

Memory Prices Over Time

RAM Prices Over Time(1977 dollars)

$0.01

$0.10

$1.00

$10.00

$100.00

$1,000.00

$10,000.00

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

Year

Do

llars

per

GB

2K

8K

32K

128K

512K

2M

8M

conventional DRAM

“Soon it will be free…”

Page 48: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 48

Memory Prices: Inflection Point!

RAM Prices Over Time(1977 dollars)

$0.01

$0.10

$1.00

$10.00

$100.00

$1,000.00

$10,000.00

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

Year

Do

llars

per

GB

2K

8K

32K

128K

512K

2M

8M

512M

1G

SDRAM,RDRAM,DDR,Chipkill

conventional DRAM

Page 49: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 49

Memory Is Actually Expensive

Desktops: Most ship with 256MB 1GB = 50% more $$ Laptops = 70%, if

possible Limited capacity

Servers: Buy 4GB, get 1 CPU

free! Sun Enterprise 10000:

8GB extra = $150,000!

Fast RAM – new technologies

Cosmic rays…

8GB Sun RAM = 1 Ferrari Modena

Page 50: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 50

Key Problem: Paging

Garbage collectors: VM oblivious GC disrupts LRU queue Touches non-resident pages

Virtual memory managers: GC oblivious Likely to evict pages needed by GC

Paging Orders of magnitude more time than RAM BIG hit in performance and LONG pauses

Page 51: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 51

Cooperative Robust AutomaticMemory Management (CRAMM)

Joint work: Eliot Moss (UMass), Scott Kaplan (Amherst College)

Garbage collector Virtual memory manager

Adjusts heap size

Coarse-grained(heap-level)

Evacuates pagesSelects victim pages

Fine-grained(page-level)

Tracks per-process,overall

memory utilization

Page replacement

change inmemory pressure

new heap size

page eviction notification

victim page(s)

I’m a cooperativeapplication!

Page 52: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 52

Fine-Grained Cooperative GC

Goal: GC triggers no additional paging Key ideas:

Adapt collection strategy on-the-fly Page-oriented memory management Exploit detailed page information from VM

Garbage collector Virtual memory manager

Evacuates pagesSelects victim pages

Fine-grained

Page replacement

page eviction notification

victim page(s)

Page 53: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 53

Summary

Building memory managers Heap Layers framework

Problems with memory managers Contention, space, false sharing

Solution: provably scalable allocator Hoard

Future directions

Page 54: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 54

If You Have to Spend $$...

more memory: bad more Ferraris: good

Page 55: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 55www.cs.umass.edu/~emery/plasma

Page 56: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 56

This Page Intentionally Left Blank

Page 57: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 57

Virtual Memory Manager Support

New VM required: detailed page-level information

“Segmented queue” for low-overhead

Local LRU order per-process, not gLRU (Linux)

Complementary to SAVM work:“Scheduler-Aware Virtual Memory manager”

Under development – modified Linux kernel

unprotected protected

Page 58: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 58

Current Work: Robust Performance

Currently: no VM-GC communicaton BAD interactions under memory

pressure Our approach (with Eliot Moss, Scott

Kaplan):Cooperative Robust Automatic Memory Management

Garbage collector

/ allocator

Virtual memory manager

LRU queuememory pressure

empty pages

reduced impact

Page 59: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 59

Current Work: Predictable VMM

Recent work on scheduling for QoS E.g., proportional-share Under memory pressure, VMM is

scheduler Paged-out processes may never recover Intermittent processes may wait long time

Scheduler-faithful virtual memory(with Scott Kaplan, Prashant Shenoy) Based on page value rather than order

Page 60: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 60

ConclusionMemory management for high-performance

applications Heap Layers framework [PLDI 2001]

Reusable components, no runtime cost Hoard scalable memory manager [ASPLOS-IX]

High-performance, provably scalable & space-efficient Reap hybrid memory manager [OOPSLA 2002]

Provides speed & robustness for server applications

Current work: robust memory management for multiprogramming

Page 61: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 61

The Obligatory URL Slide

http://www.cs.umass.edu/~emery

Page 62: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 62

If You Can Read This,I Went Too Far

Page 63: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 63

Hoard: Under the Hood

MallocOrFreeHeap

PerProcessorHeap

SelectSizeHeap

LockedHeap

HeapBlockManager

LockedHeap

SuperblockHeap

LockedHeap

HeapBlockManager

SystemHeap

Largeobjects(> 4K)

FreeToHeapBlock

EmptyHeap Blocks

LockedHeap

HeapBlockManager

select heap based on size

malloc from local heap, free to heap block

get or return memory to global heap

Page 64: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 64

Custom Memory Allocation

Very common practice Apache, gcc, lcc,

STL, database servers…

Language-level support in C++

Replace new/delete,bypassing general-purpose allocator Reduce runtime – often Expand functionality –

sometimes Reduce space – rarely

“Use custom allocators”

Page 65: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 65

Drawbacks of Custom Allocators

Avoiding memory manager means: More code to maintain & debug Can’t use memory debuggers Not modular or robust:

Mix memory from customand general-purpose allocators → crash!

Increased burden on programmers

Page 66: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 66

Overview

Introduction Perceived benefits and drawbacks Three main kinds of custom allocators Comparison with general-purpose

allocators Advantages and drawbacks of regions Reaps – generalization of regions &

heaps

Page 67: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 67

Class1 free list

(1) Per-Class Allocators

a

b

c

a = new Class1;b = new Class1;c = new Class1;delete a;delete b;delete c;a = new Class1;b = new Class1;c = new Class1;

Recycle freed objects from a free list

+ Fast+ Linked list

operations + Simple

+ Identical semantics

+ C++ language support

- Possibly space-inefficient

Page 68: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 68

(II) Custom Patterns

Tailor-made to fit allocation patterns Example: 197.parser (natural

language parser)char[MEMORY_LIMIT]

a = xalloc(8);b = xalloc(16);c = xalloc(8);xfree(b);xfree(c);d = xalloc(8);

a b cd

end_of_arrayend_of_arrayend_of_arrayend_of_arrayend_of_arrayend_of_array

+ Fast+ Pointer-bumping

allocation

- Brittle- Fixed memory size- Requires stack-like

lifetimes

Page 69: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 69

(III) Regions

+ Fast+ Pointer-bumping

allocation+ Deletion of chunks

+ Convenient+ One call frees all memory

regionmalloc(r, sz)regiondelete(r)

Separate areas, deletion only en masseregioncreate(r) r

- Risky- Accidental

deletion- Too much

space

Page 70: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 70

Overview

Introduction Perceived benefits and drawbacks Three main kinds of custom allocators Comparison with general-purpose

allocators Advantages and drawbacks of regions Reaps – generalization of regions &

heaps

Page 71: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 71

Custom Allocators Are Faster…

Runtime - Custom Allocator Benchmarks

0

0.25

0.5

0.75

1

1.25

1.5

1.75

197.

pars

er

boxe

d-sim

c-br

eeze

175.

vpr

176.

gcc

apac

helcc

mud

lle

Non-re

gions

Regio

ns

Ove

rall

No

rma

lize

d R

un

tim

e

Custom Win32

non-regions regions averages

Page 72: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 72

Not So Fast…

Runtime - Custom Allocator Benchmarks

0

0.25

0.5

0.75

1

1.25

1.5

1.75

197.

pars

er

boxe

d-sim

c-br

eeze

175.

vpr

176.

gcc

apac

he lcc

mud

lle

Non-re

gions

Regio

ns

Overa

ll

No

rma

lize

d R

un

tim

e

Custom Win32 DLmalloc

non-regions regions averages

Page 73: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 73

The Lea Allocator (DLmalloc 2.7.0)

Optimized for common allocation patterns Per-size quicklists ≈ per-class

allocation Deferred coalescing

(combining adjacent free objects) Highly-optimized fastpath Space-efficient

Page 74: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 74

Space Consumption Results

Space - Custom Allocator Benchmarks

00.250.5

0.751

1.251.5

1.75

No

rmal

ized

Sp

ace

Original DLmalloc

regionsnon-regions averages

Page 75: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 75

Overview

Introduction Perceived benefits and drawbacks Three main kinds of custom allocators Comparison with general-purpose

allocators Advantages and drawbacks of regions Reaps – generalization of regions &

heaps

Page 76: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 76

Why Regions?

Apparently faster, more space-efficient

Servers need memory management support: Avoid resource leaks

Tear down memory associated with terminated connections or transactions

Current approach (e.g., Apache): regions

Page 77: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 77

Drawbacks of Regions

Can’t reclaim memory within regions Problem for long-running computations,

producer-consumer patterns,off-the-shelf “malloc/free” programs

unbounded memory consumption

Current situation for Apache: vulnerable to denial-of-service limits runtime of connections limits module programming

Page 78: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 78

Reap = region + heap Adds individual object deletion & heap

Reap Hybrid Allocator

reapmalloc(r, sz)

reapdelete(r)

reapcreate(r)r

reapfree(r,p)

Can reduce memory consumption+ Fast

+ Adapts to use (region or heap style)+ Cheap deletion

Page 79: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 79

Using Reap as Regions

Runtime - Region-Based Benchmarks

0

0.5

1

1.5

2

2.5

lcc mudlle

No

rma

lize

d R

un

tim

e

Original Win32 DLmalloc WinHeap Vmalloc Reap

4.08

Reap performance nearly matches regions

Page 80: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 80

Reap: Best of Both Worlds

Combining new/delete with regionsusually impossible:

Incompatible API’s Hard to rewrite code

Use Reap: Incorporate new/delete code into Apache “mod_bc” (arbitrary-precision calculator)

Changed 20 lines (out of 8000) Benchmark: compute 1000th prime

With Reap: 240K Without Reap: 7.4MB

Page 81: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 81

Conclusion

Empirical study of custom allocators Lea allocator often as fast or faster Custom allocation ineffective,

except for regions Reaps:

Nearly matches region performancewithout other drawbacks

Take-home message: Stop using custom memory allocators!

Page 82: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 82

Software

http://www.cs.umass.edu/~emery

(part of Heap Layers distribution)

Page 83: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 83

Experimental Methodology

Comparing to general-purpose allocators Same semantics: no problem

E.g., disable per-class allocators Different semantics: use emulator

Uses general-purpose allocatorbut adds bookkeeping

regionfree: Free all associated objects Other functionality (nesting, obstacks)

Page 84: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 84

Use Custom Allocators?

Strongly recommended by practitioners

Little hard data on performance/space improvements Only one previous study [Zorn 1992] Focused on just one type of allocator Custom allocators: waste of time

Small gains, bad allocators Different allocators better? Trade-

offs?

Page 85: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 85

Kinds of Custom Allocators

Three basic types of custom allocators Per-class

Fast Custom patterns

Fast, but very special-purpose Regions

Fast, possibly more space-efficient Convenient Variants: nested, obstacks

Page 86: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 86

Optimization Opportunity

Time Spent in Memory Operations

0

20

40

60

80

100

% o

f ru

nti

me

Memory Operations Other

Page 87: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 87

Page 88: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 88

Custom Memory Allocation

Programmers often replace malloc/free Attempt to increase performance Provide extra functionality (e.g., for

servers) Reduce space (rarely)

Empirical study of custom allocators Lea allocator often as fast or faster Custom allocation ineffective,

except for regions. [OOPSLA 2002]

Page 89: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 89

Overview of Regions

+ Fast+ Pointer-bumping

allocation+ Deletion of chunks

+ Convenient+ One call frees all memory

regionmalloc(r, sz)regiondelete(r)

Separate areas, deletion only en masseregioncreate(r) r

- Risky- Accidental

deletion- Too much

space

Page 90: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 90

Why Regions?

Apparently faster, more space-efficient

Servers need memory management support: Avoid resource leaks

Tear down memory associated with terminated connections or transactions

Current approach (e.g., Apache): regions

Page 91: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 91

Drawbacks of Regions

Can’t reclaim memory within regions Problem for long-running computations,

producer-consumer patterns,off-the-shelf “malloc/free” programs

unbounded memory consumption

Current situation for Apache: vulnerable to denial-of-service limits runtime of connections limits module programming

Page 92: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 92

Reap = region + heap Adds individual object deletion & heap

Reap Hybrid Allocator

reapmalloc(r, sz)

reapdelete(r)

reapcreate(r)r

reapfree(r,p)

Can reduce memory consumption Fast

Adapts to use (region or heap style) Cheap deletion

Page 93: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 93

Using Reap as Regions

Runtime - Region-Based Benchmarks

0

0.5

1

1.5

2

2.5

lcc mudlle

No

rma

lize

d R

un

tim

e

Original Win32 DLmalloc WinHeap Vmalloc Reap

4.08

Reap performance nearly matches regions

Page 94: Memory Management for High-Performance Applications

UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS A AMHERST • MHERST • Department of Computer Science Department of Computer Science 94

Reap: Best of Both Worlds

Combining new/delete with regionsusually impossible:

Incompatible API’s Hard to rewrite code

Use Reap: Incorporate new/delete code into Apache “mod_bc” (arbitrary-precision calculator)

Changed 20 lines (out of 8000) Benchmark: compute 1000th prime

With Reap: 240K Without Reap: 7.4MB