Memory unmanglement

Memory Unmanglement With Perl

How to do what you dowithout getting hit in the memory.

Steven LembarkWorkhorse Computing

In Our Last Episode...

● We saw our hero battling the forces of rambloat in longrunning, heavilyforked, or largescale processes.

● Learned the golden rule: Nothing Shrinks.● Observed memory benchmarks using Devel::Peek,

Devel::Size, and perl -d.● peek() shows the structure & hash efficiency.● size() & total_size() show memory usage.

Time vs. Space

● The classic tradeoff is handled in favor of time in the perl implementations.

● More efficient data structures can help both sides.● Avoiding wasted space can help avoid thrashing, heap

management, and system call overhead.● Faster access for arrays can make them more compact

and faster than hashes in some situations.

● Benchmarks are not only for time: include checks of size(), total_size(), and peek() to see what is really going on.

Nothing Ever Shrinks

● perl maintains strings and arrays as pointers to memory allocations.● Adjusting the size of a scalar with substr or a regex

changes it start and length.● shift and pop adjust an array's initial offset and count.

● None of these will reduce the memory overhead of the 'scaffolding' perl uses to manage the data.

Look Deep Into Your Memory

● Devel::Peek● peek() at the structure● Shows efficiency of hashing.

● Devel::Size● size() shows memory usage of “scaffolding”.● total_size() includes contents along with skeleton.

● size() can be useful in loops for managing size of recycled buffers.

Size & Structure

● Scalars● Reference allocations for strings with offset & length.● size() of the scalar is small, total_size() can be large.

● Arrays● Allocated list of Scalars, also with offset & length.● size() reports space for list, total_size() includes contents.

● Hashes● Hash chains are an array of arrays with min. 8 chains.● size() reports space for hash chains.

Taming the Beast

● There are tools for managing the memory, most of which involve some sort of time/space tradeoff.● undef can help – probably less than you think.● You can manage the lifetime of variables with lexical or

local values.● Recycling buffers localizes the bloat to one structure.● Adapting your code to use more effective data structures

offers the best solution for large data.

● Here are some ideas.

undef() is somewhat helpful

● Marks the variable for reclamation.● Space may not be immediately reclaimed – up to perl

whether to add heap or recycle the undefed variables.

● Structures are discarded, not reduced.● This can have a significant performance overhead on

nested, reused data structures.

● Tradeoff: space for time for rebuilding the skeleton of discarded structures.

● Most useful for recycling singlelevel structures.

undefing an Array Doesn't Zero It

● For a large, nested structure this may not save the amount of space you expect.

my @a = ();$#a = 999_999;print "Size \@a:\t", size( \@a ), "\n";

undef @a;print "Size \@a:\t", size( \@a ), "\n";

Full @a:4000200Post @a: 100

● The contents are discarded & reallocated:

Recycling Buffers

● Use size() to discard and reallocate the buffer if it grows too large.

● Preallocate to avoid marginoferror added by perl when the initial allocation grows.

● Decent tradeoff between reallocating a buffer frequently and having it grow without bounds.

● Avoids one record botching the entire processing cycle.

Scalar Buffer

● Recycle buffer, clean it up, then copy by value.● Easiest with scalars since they don't have any nested

structure.while( $buffer = get_data ){ $buffer =~ s/^\s+//; ... push @data, $buffer;

if( size( $buffer ) > $max_buff ) { undef $buffer; $buffer = ' ' x $max_buff; }}

Array Buffer

● This works well for single level buffers multilevel buffers often require too much work to rebuild.my @buff = ();$#buff = $buff_count;

while( @buff = get_data ){ ... # clean up buffer $data{ $key } = [ @buff ]; # store values

if( size( \@a ) > $buff_max ) { undef @buff; $#buff = $max_buff; }}

Assign Arrays SinglePass

● Say you have to store a large number of items:

my @a = @b = ();

push @a, “” for( 1 .. 1_000_000 );@b = map { “” } ( 1 .. 1_000_000 );

print 'Size of @a: ', size( \@a ), "\n";print 'Size of @b: ', size( \@b ), "\n";

Size of @a: 4194388Size of @b: 4000100

● Push ends up with a larger structure:

Hashes are Huge

● Incremental assignment doesn't make hashes larger: they are 8x larger than arrays in both cases.

my %a =();my %b = ();

$a{ $_ } = “” for ( 1 .. 1_000_000 );%b = map { $_ => “” } ( 1 .. 1_000_000 );

print 'Size of %a: ', size( \%a ), "\n";print 'Size of %b: ', size( \%b ), "\n";

Size of %a: 32083244 # vs. 4000100Size of %b: 32083244 # in an array!

Two Ways of Storing Nothing

● There are two common ways of storing nothing in the values of a hash:● Assign an empty list: $hash{ $key } = ();

● Assign an empty string: $hash{ $key } = “”;

● Question:

Which would take less space: empty list or empty string?

TMTOWTDN

my %a =();my %b = ();

$a{ $_ } = () for( 'aaa' .. 'zzz' );$b{ $_ } = '' for( 'aaa' .. 'zzz' );

print "Size of %a:\t", size( \%a ), "\n";print "Size of %b:\t", size( \%b ), "\n";

Size of %a: 570516 # same size for “” & ()?Size of %b: 570516

● size() gives the same result for both values. Why?

TMTOWTDN

my %a =();my %b = ();

$a{ $_ } = () for( 'aaa' .. 'zzz' );$b{ $_ } = '' for( 'aaa' .. 'zzz' );

print "Size of %a:\t", size( \%a ), "\n";print "Size of %b:\t", size( \%b ), "\n";

print "Total in %a:\t", total_size( \%a ), "\n";print "Total in %b:\t", total_size( \%b ), "\n";

Size of %a: 570516 # size() doesn't alwaysSize of %b: 570516 # matter!

Total in %a: 851732Total in %b: 1203252

● total_size() benchmarks the values:

Replace Hashes With Arrays

● The smartmatch operator (“~~”) is fast.● Pushing onto an array:

$a ~~ @uniq or push @uniq, $a

uses about 1/8 the space of assigning hash keys:$uniq{ $a } = ();

...

keys %uniq

● The extra space used by array growth in push is dwarfed by the savings of an array over a hash.

● sort @uniq is much faster than sort keys %uniq.

Example: Taxonomy Trees

● The NCBI Taxonomy is delivered with each entry having a full tree.

● These must be reduced to a single tree for data entry and validation.

● There are several ways to do this...

Worst Solution: Parent tree.

● Since the tree is often used from the bottom up, some people store it as a child:parent relationship:

$parentz{ $child_id } = $parent_id;

● Unfortunately, this allocates a full hash table for each 1:1 relationship between a child and parent.

Another Bad Solution: Child Tree

● Another alternative is storing the children in a hash for each parent:

$childz{ $parent_id }{ $child_id } = ();

$childz{ '' } = [ $root_id ];

● This works via depthfirst search to generate the trees and has space to store the treedepth.

● Hashes are bulky and slow for storing a singlelevel structure like this.

Another Solution: SingleLevel Hash

● One oftforgotten bit of Perly lore in the age of references: multipart hash keys.

$childz{ $parent_id, $child_id } = $depth;

$childz{ “” } = [ $root_id ];

● Trades wasted space in thousands of anon hashes for split /$;/o, $key and grep's.

● Usable for moderate trees.● Obviously painful for really large trees.

Q: Why Nest Hashes?● Hashes are nice for the toplevel lookup, but why

nest them?

● Arrays save about 85% of the overhead below the top level.

● Any wasted space from the arrays growing via push is more than saved by avoiding hashes.

● The arrays only need to be sorted once if the tree is used multiple times.

my $c = $childz{ $parent_id } ||=[];

$new_id ~~ $cor push @{ $c{ $parent_id } }, $new_id;

Nested Lists

● List::Util has first() which saves greping entire lists.● A key and payload on an array can be handled

quickly.first { $_->[0] eq $key } @data;

● For shorter lists this saves space and can be faster than a hash.

● This is best for numerics, which don't have to be converted to text in order to be hashed: $_->[0] == $value is the least amount of work to compare integers.

Manage Lifespans

● Lexical variables are an obvious place.● Local values are another.

● Saves reallocating a set of values within tight loops in the called code.

● Local hash keys are a good way to manage storage in reused hashes handled with recursion.

● Use delete to remove hash keys in multilevel structures instead of assigning an empty list or “”.● This preserves the skeleton for recycling.● Saves storing the keys.

Use Simpler Objects

● If you're using insideout objects, why bless a hash?● Users aren't supposed to diddle around inside your

objects anyway.

● The only thing you care about is the address.● Bless something smaller:

my $obj = bless \(my $a), $package;

Use Linked Lists for Queues

● Automatically frees discarded nodes without having to modify the entire list.

● Based on an array they don't use much extra data:$node = [ $ref_to_next, @node_data ];

● Walking the list is simple enough:( $node, my @data ) = @$node;

● So is removing a node:$node->[0] = $node->[0][0];

● These are quite convenient for threading.

Use Hashes for Sparse Arrays

● OK, Time to stop beating up on hashes.● They beat out arrays for sparse lists.● Even list of integers.

● Say a collection of DNA runs from 15 to 10_000 bases, filling about 10% of the actual values.

● You could store it as:$dnaz[ $length ] = [ qw( dna dna dna ) ];

● But this is probably better stored in a hash:$dnaz{ $length } = [ qw( dna dna dna ) ];

Accessing Hash Keys: Integer Slices

● Numeric sequences work fine as hash keys.● Say you want to find all of the sequences within

+/ 10% of the current length:‑

my $min = 0.9 * $length;my $max = 1.1 * $length;my @found = grep{ $_ } @dnaz{ ( $min .. $max ) };

● For nontrivial, sparse lists this saves scaffolding by only storing the structure necessary.

● This doesn't change the data storage, just the overhead for accessing it by length.

Store Uppertriangular Comparisons

● Saves more than half the space.● Accessor can look for $i > $j ? [$i][$j] : [$j][$i] and

get the same results.● Requires designing symmetric comparison

algorithms (values can be returned asis or just negated).

● Also saves about half the processing time to only generate a single comparison for each pair.

● Requires access to the algorithm.

Example: DNA Analysis

● Our Wcurve analysis is used to compare large groups of DNA to one another.

● The original algorithm compared the curves until the first one was exhausted.

● Changing that to use the longer sequence in all cases saved us over half the comparison time.

Summary

● Devel::Size can be useful in your code.● Managing the lifespan of values helps.● Using efficient structures helps even more.

● Use arrays instead of hash structures where they make sense.

● Bless smaller structures: scalars, regexen, globs make perfectly good objects and take less space than hashes.

● Use XS or Inline where necessary.● And, yes, size() still matters.

Memory unmanglement

Documents