Memory Un-manglement With Perl How to do what you do without getting hit in the memory. Steven Lembark Workhorse Computing
Memory Unmanglement With Perl
How to do what you dowithout getting hit in the memory.
Steven LembarkWorkhorse Computing
In Our Last Episode...
● We saw our hero battling the forces of rambloat in longrunning, heavilyforked, or largescale processes.
● Learned the golden rule: Nothing Shrinks.● Observed memory benchmarks using Devel::Peek,
Devel::Size, and perl -d.● peek() shows the structure & hash efficiency.● size() & total_size() show memory usage.
Time vs. Space
● The classic tradeoff is handled in favor of time in the perl implementations.
● More efficient data structures can help both sides.● Avoiding wasted space can help avoid thrashing, heap
management, and system call overhead.● Faster access for arrays can make them more compact
and faster than hashes in some situations.
● Benchmarks are not only for time: include checks of size(), total_size(), and peek() to see what is really going on.
Nothing Ever Shrinks
● perl maintains strings and arrays as pointers to memory allocations.● Adjusting the size of a scalar with substr or a regex
changes it start and length.● shift and pop adjust an array's initial offset and count.
● None of these will reduce the memory overhead of the 'scaffolding' perl uses to manage the data.
Look Deep Into Your Memory
● Devel::Peek● peek() at the structure● Shows efficiency of hashing.
● Devel::Size● size() shows memory usage of “scaffolding”.● total_size() includes contents along with skeleton.
● size() can be useful in loops for managing size of recycled buffers.
Size & Structure
● Scalars● Reference allocations for strings with offset & length.● size() of the scalar is small, total_size() can be large.
● Arrays● Allocated list of Scalars, also with offset & length.● size() reports space for list, total_size() includes contents.
● Hashes● Hash chains are an array of arrays with min. 8 chains.● size() reports space for hash chains.
Taming the Beast
● There are tools for managing the memory, most of which involve some sort of time/space tradeoff.● undef can help – probably less than you think.● You can manage the lifetime of variables with lexical or
local values.● Recycling buffers localizes the bloat to one structure.● Adapting your code to use more effective data structures
offers the best solution for large data.
● Here are some ideas.
undef() is somewhat helpful
● Marks the variable for reclamation.● Space may not be immediately reclaimed – up to perl
whether to add heap or recycle the undefed variables.
● Structures are discarded, not reduced.● This can have a significant performance overhead on
nested, reused data structures.
● Tradeoff: space for time for rebuilding the skeleton of discarded structures.
● Most useful for recycling singlelevel structures.
undefing an Array Doesn't Zero It
● For a large, nested structure this may not save the amount of space you expect.
my @a = ();$#a = 999_999;print "Size \@a:\t", size( \@a ), "\n";
undef @a;print "Size \@a:\t", size( \@a ), "\n";
Full @a:4000200Post @a: 100
● The contents are discarded & reallocated:
Recycling Buffers
● Use size() to discard and reallocate the buffer if it grows too large.
● Preallocate to avoid marginoferror added by perl when the initial allocation grows.
● Decent tradeoff between reallocating a buffer frequently and having it grow without bounds.
● Avoids one record botching the entire processing cycle.
Scalar Buffer
● Recycle buffer, clean it up, then copy by value.● Easiest with scalars since they don't have any nested
structure.while( $buffer = get_data ){ $buffer =~ s/^\s+//; ... push @data, $buffer;
if( size( $buffer ) > $max_buff ) { undef $buffer; $buffer = ' ' x $max_buff; }}
Array Buffer
● This works well for single level buffers multilevel buffers often require too much work to rebuild.my @buff = ();$#buff = $buff_count;
while( @buff = get_data ){ ... # clean up buffer $data{ $key } = [ @buff ]; # store values
if( size( \@a ) > $buff_max ) { undef @buff; $#buff = $max_buff; }}
Assign Arrays SinglePass
● Say you have to store a large number of items:
my @a = @b = ();
push @a, “” for( 1 .. 1_000_000 );@b = map { “” } ( 1 .. 1_000_000 );
print 'Size of @a: ', size( \@a ), "\n";print 'Size of @b: ', size( \@b ), "\n";
Size of @a: 4194388Size of @b: 4000100
● Push ends up with a larger structure:
Hashes are Huge
● Incremental assignment doesn't make hashes larger: they are 8x larger than arrays in both cases.
my %a =();my %b = ();
$a{ $_ } = “” for ( 1 .. 1_000_000 );%b = map { $_ => “” } ( 1 .. 1_000_000 );
print 'Size of %a: ', size( \%a ), "\n";print 'Size of %b: ', size( \%b ), "\n";
Size of %a: 32083244 # vs. 4000100Size of %b: 32083244 # in an array!
Two Ways of Storing Nothing
● There are two common ways of storing nothing in the values of a hash:● Assign an empty list: $hash{ $key } = ();
● Assign an empty string: $hash{ $key } = “”;
● Question:
Which would take less space: empty list or empty string?
TMTOWTDN
my %a =();my %b = ();
$a{ $_ } = () for( 'aaa' .. 'zzz' );$b{ $_ } = '' for( 'aaa' .. 'zzz' );
print "Size of %a:\t", size( \%a ), "\n";print "Size of %b:\t", size( \%b ), "\n";
Size of %a: 570516 # same size for “” & ()?Size of %b: 570516
● size() gives the same result for both values. Why?
TMTOWTDN
my %a =();my %b = ();
$a{ $_ } = () for( 'aaa' .. 'zzz' );$b{ $_ } = '' for( 'aaa' .. 'zzz' );
print "Size of %a:\t", size( \%a ), "\n";print "Size of %b:\t", size( \%b ), "\n";
print "Total in %a:\t", total_size( \%a ), "\n";print "Total in %b:\t", total_size( \%b ), "\n";
Size of %a: 570516 # size() doesn't alwaysSize of %b: 570516 # matter!
Total in %a: 851732Total in %b: 1203252
● total_size() benchmarks the values:
Replace Hashes With Arrays
● The smartmatch operator (“~~”) is fast.● Pushing onto an array:
$a ~~ @uniq or push @uniq, $a
uses about 1/8 the space of assigning hash keys:$uniq{ $a } = ();
...
keys %uniq
● The extra space used by array growth in push is dwarfed by the savings of an array over a hash.
● sort @uniq is much faster than sort keys %uniq.
Example: Taxonomy Trees
● The NCBI Taxonomy is delivered with each entry having a full tree.
● These must be reduced to a single tree for data entry and validation.
● There are several ways to do this...
Worst Solution: Parent tree.
● Since the tree is often used from the bottom up, some people store it as a child:parent relationship:
$parentz{ $child_id } = $parent_id;
● Unfortunately, this allocates a full hash table for each 1:1 relationship between a child and parent.
Another Bad Solution: Child Tree
● Another alternative is storing the children in a hash for each parent:
$childz{ $parent_id }{ $child_id } = ();
$childz{ '' } = [ $root_id ];
● This works via depthfirst search to generate the trees and has space to store the treedepth.
● Hashes are bulky and slow for storing a singlelevel structure like this.
Another Solution: SingleLevel Hash
● One oftforgotten bit of Perly lore in the age of references: multipart hash keys.
$childz{ $parent_id, $child_id } = $depth;
$childz{ “” } = [ $root_id ];
● Trades wasted space in thousands of anon hashes for split /$;/o, $key and grep's.
● Usable for moderate trees.● Obviously painful for really large trees.
Q: Why Nest Hashes?● Hashes are nice for the toplevel lookup, but why
nest them?
● Arrays save about 85% of the overhead below the top level.
● Any wasted space from the arrays growing via push is more than saved by avoiding hashes.
● The arrays only need to be sorted once if the tree is used multiple times.
my $c = $childz{ $parent_id } ||=[];
$new_id ~~ $cor push @{ $c{ $parent_id } }, $new_id;
Nested Lists
● List::Util has first() which saves greping entire lists.● A key and payload on an array can be handled
quickly.first { $_->[0] eq $key } @data;
● For shorter lists this saves space and can be faster than a hash.
● This is best for numerics, which don't have to be converted to text in order to be hashed: $_->[0] == $value is the least amount of work to compare integers.
Manage Lifespans
● Lexical variables are an obvious place.● Local values are another.
● Saves reallocating a set of values within tight loops in the called code.
● Local hash keys are a good way to manage storage in reused hashes handled with recursion.
● Use delete to remove hash keys in multilevel structures instead of assigning an empty list or “”.● This preserves the skeleton for recycling.● Saves storing the keys.
Use Simpler Objects
● If you're using insideout objects, why bless a hash?● Users aren't supposed to diddle around inside your
objects anyway.
● The only thing you care about is the address.● Bless something smaller:
my $obj = bless \(my $a), $package;
Use Linked Lists for Queues
● Automatically frees discarded nodes without having to modify the entire list.
● Based on an array they don't use much extra data:$node = [ $ref_to_next, @node_data ];
● Walking the list is simple enough:( $node, my @data ) = @$node;
● So is removing a node:$node->[0] = $node->[0][0];
● These are quite convenient for threading.
Use Hashes for Sparse Arrays
● OK, Time to stop beating up on hashes.● They beat out arrays for sparse lists.● Even list of integers.
● Say a collection of DNA runs from 15 to 10_000 bases, filling about 10% of the actual values.
● You could store it as:$dnaz[ $length ] = [ qw( dna dna dna ) ];
● But this is probably better stored in a hash:$dnaz{ $length } = [ qw( dna dna dna ) ];
Accessing Hash Keys: Integer Slices
● Numeric sequences work fine as hash keys.● Say you want to find all of the sequences within
+/ 10% of the current length:‑
my $min = 0.9 * $length;my $max = 1.1 * $length;my @found = grep{ $_ } @dnaz{ ( $min .. $max ) };
● For nontrivial, sparse lists this saves scaffolding by only storing the structure necessary.
● This doesn't change the data storage, just the overhead for accessing it by length.
Store Uppertriangular Comparisons
● Saves more than half the space.● Accessor can look for $i > $j ? [$i][$j] : [$j][$i] and
get the same results.● Requires designing symmetric comparison
algorithms (values can be returned asis or just negated).
● Also saves about half the processing time to only generate a single comparison for each pair.
● Requires access to the algorithm.
Example: DNA Analysis
● Our Wcurve analysis is used to compare large groups of DNA to one another.
● The original algorithm compared the curves until the first one was exhausted.
● Changing that to use the longer sequence in all cases saved us over half the comparison time.
Summary
● Devel::Size can be useful in your code.● Managing the lifespan of values helps.● Using efficient structures helps even more.
● Use arrays instead of hash structures where they make sense.
● Bless smaller structures: scalars, regexen, globs make perfectly good objects and take less space than hashes.
● Use XS or Inline where necessary.● And, yes, size() still matters.