Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009
Dec 22, 2015
Optimizing XML Compression
Gregory LeightonDenilson Barbosa
University of AlbertaAugust 24, 2009
Motivation• XML’s verbose nature (repeated subtrees, lengthy tag names,…) can greatly
inflate document size• Compression is an obvious solution
– “Generic” algorithms (gzip, bzip2) often yield good compression rates for XML, yet hinder query processing since they don’t preserve the XML structure of the original file one must decompress entire document before accessing individual nodes in the XML tree
– “XML-conscious” compression schemes have been developed which are capable of preserving document structure
• For the latter schemes, choosing the “best” compression strategy is very difficult– Compressing similar data values together can reduce storage costs, but tends to increase
time costs for compression and decompression
• Our goal: to “optimize” the performance of XML-conscious compression schemes by suggesting a good trade-off between space savings and time costs
August 24, 2009 Optimizing XML Compression 2
DataCompressor
DataCompressor
XML-Conscious Compression Schemes Permutation-based Approaches
Document is rearranged to localize repetitions before passing to back-end compressor(s)– Data segments are
grouped into different containers, typically based on the parent element’s type
– Tag structure (“skeleton”) and data segments are compressed separately
August 24, 2009 Optimizing XML Compression 3
Skeleton
Shredder
Data Containers
StructureCompressor
DataCompressor
s
XML Document
August 24, 2009 Optimizing XML Compression 4
users
user
@id favorites
movies music
user
@id favorites
movies
“SmartPeople”
moviemovie movie
title titleyear year
“2008”“Scary Movie
2”
title artist
song
“u-8026”
“2001”
“u-9125”
“A NewCareer in aNew Town”
“DavidBowie”
title year
“Smoke” “1995”
prestige prestige
“4.7” “3.9”
Path-Based Partitioning StrategyA
/users/user/@id
“u-8026”
“u-9125”
B/users/user/prestige
“4.7”
“3.9”
C/users/user/favorites/movies/movie/title
“Smart People”
“Scary Movie 2”
“Smoke”
D/users/user/favorites/movies/movie/year
“2008”
“2001”
“1995”
E/users/user/favorites/movies/movie/rating
“4.7”
“3.1”
“4.9”
F/users/user/favorites/music/song/title
“A New Career in a New Town”
G/users/user/favorites/music/song/artist
“David Bowie”
H/users/user/favorites/music/song/rating
“4.1”
August 24, 2009 5Optimizing XML Compression
Improving Permutation-Based XML Compression
August 24, 2009 Optimizing XML Compression 6
• Selecting a different container partitioning strategy• Consider additional factors besides identity of parent element (e.g., data
type of content)• Grouping together containers with high pairwise similarity allows them
to share same compression model, reducing storage costs• Can improve random access performance – group together values
frequently appearing in same queries • Choosing “better” compression algorithms for container subsets
• Choose an algorithm best suited for the data type• Certain algorithms support operations over compressed data (e.g., the
Huffman algorithm supports equality comparisons)
Goal is to find an optimal compression configuration specifying a partitioning strategy and an algorithm assignment for each container subset such that overall compression gain is maximized
An Example Compression Configuration
August 24, 2009 Optimizing XML Compression 7
C,F,G“Smart People”
“Scary Movie 2”
“Smoke”
“A New Career in a New Town”
“David Bowie”
A“u-8026”
“u-9125”
D“2008”
“2001”
“1995”
B,E,H“4.7”
“3.9”
“4.7”
“3.1”
“4.9”
“4.1”
Huffmancoding
LZ77
LZ77
Huffmancoding
Form a container subset consisting of containers C, F, and G based on intuition that these containers possess“similar” data (English text)
The contents of these 3 containers are concatenated and compressed using a single run of the LZ77 algorithm.
Containers B, E, and H store real numbers, so it may be beneficial to group them within a single container subset.
Terminology:
Container subset = a set of one or more containersContainer grouping = a set of one or more container subsetsPartitioning strategy = a container grouping that covers all containers
Evaluating Compression Configurations
• 3 relevant measures:– Storage gain: measures relative amount of space saved by
applying a specific compression algorithm a to a container subset S, denoted gain(S,a)
• considers not just the compressed size of S, but also the space requirements of the data structures constructed by the compression source model (e.g., Huffman tree)
– Compression and decompression time costs: measure how long it takes to apply/reverse the compression process of a on S, denoted as comp(S,a) and decomp(S,a)
August 24, 2009 8Optimizing XML Compression
Discovering an Optimal Compression Configuration
Problem: Supplied with algorithm set A, set of data containers C, upper bounds Tc and Td on compression/decompression time costs, find the compression configuration that maximizes the total storage gain while observing the bounds defined by Tc and Td.
Theorem: Selecting an optimal compression configuration is NP-hard.
“Hardness” is due to the inherent difficulty of selecting an optimal partitioning strategy; algorithm assignment requires polynomial time (w.r.t. |A| and |C|)
August 24, 2009 9Optimizing XML Compression
An Approximation Algorithm for Compression Configuration Selection
• Stage 1: Use a branch-and-bound procedure to discover a set of candidate partitioning strategies
• Stage 2: Test available compression algorithms against the set of candidate partitioning strategies, returning the resulting compression configuration that maximizes compression gain while observing specified bounds on compression/decompression time costs
August 24, 2009 10Optimizing XML Compression
Branch-and-Bound ProcedureEstimating Compressibility of Container Subsets
Shannon’s entropy rate indicates the best compression any lossless scheme can achieve– impractical to compute, so we instead turn to an approximate
measure of compressibility
LZ76: Lempel and Ziv (1976) proposed a measure of finite string complexity that asymptotically approaches the entropy rate
Idea: parse input string x from left-to-right, recursively building a set of phrases Px. Complexity CLZ(x) is given by the ratio of phrases per character:
CLZ(x) = |Px|/|x|
August 24, 2009 11Optimizing XML Compression
Branch-and-Bound ProcedureEstimating Storage Costs for a Container Subset
The storage cost (in bits) associated with a container subset S is computed as
storageCost(S) = t · (8 + log2(t))
where t is the number of phrases in the LZ76 dictionary for S
August 24, 2009 12Optimizing XML Compression
Represents the cost for encoding the “innovative”character each time a new dictionary entry is created
Using a fixed-length binary encoding, each dictionary phrase requires log2(t) bits, or t · log2(t) bits for all t phrases
The storage cost for a container grouping is the sum of storage costs of each container subset in that grouping.
Branch-and-Bound ProcedureModeling Compression Gain
Local compression gain (localGain): indicates the overall gain achieved by a grouping G; computed as the sum of localGain values of each container subset S in G
The localGain for each subset S is computed as
August 24, 2009 13Optimizing XML Compression
where
and |S| denotes the total byte length of the contents of S.
Uncompressed size of S
Estimated sizeof compressed S
Size of LZ76 dictionary for S
Branch-and-Bound ProcedureModeling Compression Gain
Maximum potential compression gain (mpGain): indicates an upper bound on the achievable localGain for any partitioning strategy that uses the current grouping G as a “starting point” (i.e., it preserves the subset placement of all containers present in G)
Maximized for a container subset when CLZ(S) and storageCost(S) are minimized– this happens when the longest applicable phrase is created at every
step during LZ76 parsing of S, by appending a new character to the end of the longest phrase currently in the dictionary
– mpGain calculation consists of building successively longer phrases, until all remaining characters in the original container set C have been processed
August 24, 2009 14Optimizing XML Compression
Example Gain CalculationAssume there are two containers, C1 = {aaabc} and C2, containing an additional 5 characters.
The localGain of C1 is computed by performing an LZ76 parsing of C1.
a a a b c Dictionary: <a> <aa> <b> <c>To compute mpGain(C1), we continue performing these steps until the 5 characters from C2 have been processed:
• Select longest phrase P in the dictionary, having length L
• If L < nUnprocessedChars, add to the dictionary a new phrase of length L + 1 by appending a new character to P and subtract L + 1 from nUnprocessedChars.
<aaa>
nUnProcessedChars: 5
• Otherwise, choose existing phrase of length nUnprocessedChars and stop.
CLZ(C1) = 4 phrases / 5 characters = 0.8
storageCost(C1) = 4 · (8 + log2(4)) = 40 bits
localGain(C1) = max{0, 5 · 8 – (0.8 · 5 + (4 · (8 + log2(4)))) } = 0
C1 should be left uncompressed
20
Updated CLZ = 5 phrases / 10 characters = 0.5
Updated storageCost = 5 ·(8 + log2(5)) ≈ 51.6096 bits
mpGain ≈ 10 · 8 – (0.5 · 10 + 51.6096) ≈ 23.3904 bitsAugust 24, 2009 15Optimizing XML Compression
Branch-and-Bound ProcedureSearch Tree
• Each tree node corresponds to a container grouping
• Each node stores local compression gain (localGain) and maximum potential compression gain (mpGain) values for its container grouping
• Nodes at depth i in tree represent all possibilities for assigning container i to a container grouping
• Crucial to avoid enumerating the entire search tree…
August 24, 2009 16Optimizing XML Compression
{C1}
{C1},{C2} {C1,C2}
Bounding criterion:Kill each subtree rooted at a node n for which
mpGain(n) < optGain - δ
Best local gainwitnessed so far
Threshold value in+
Remaining nodes at the bottom level of search tree represent the set of candidate partitioning strategies.
Example: C = {C1,C2,C3}, where C1={aaabcaaabcaaabcabcab}, C2={15720653197608243849},C3={abcababcbaaaabcabcab}, δ = 15.0 bits
{C1}localGain = 50.4707mpGain = 300.6970
{C1},{C2}localGain = 50.4707mpGain = 168.9804
{C1,C2}localGain = 0
mpGain = 108.6180
{C1},{C2},{C3}localGain = 100.9413mpGain = 100.9413
{C1,C3},{C2}localGain = 126.3966mpGain = 126.3966
{C1},{C2,C3}localGain = 50.4707mpGain = 50.4707
August 24, 2009 17Optimizing XML Compression
Left child has highest mpGain, so we explore it first
mpGains of 1st and 3rd children are less than 126.3966 – 15.0 = 111.3966, so kill both
mpGain is less than 126.3966 – 15.0 = 111.3966, so kill without exploring this subtree
{C1,C3},{C2} is the only candidate partitioning strategy
Determining an Optimal Compression Configuration
For each candidate partitioning strategy G returned by the branch-and-bound procedure:1. For each container subset S in G, assign the compression
algorithm that achieves the best compression gain while obeying specified time bounds on compression/decompression.
2. Compute the overall compression gain for G as the sum of gains for each S in G.
Choose the G with the highest overall compression gain, together with the corresponding algorithm assignment, as the compression configuration.
August 24, 2009 18Optimizing XML Compression
Early ResultsOriginal
Size (B)
# of Containers/Search Tree
Height
Default Config/XMill +
zlib
SuggestedConfig/XMill +
zlib
ExploredNodes
TotalNodes
Baseball 671,924 41 35,554 35,001 2134 2.35 x 1036
DBLP 97,190 28 17,277 15,217 1383 6.16 x 1021
Weblog 2,648,181 10 74,450 74,421 199 115,975
Lineitem 32,295,475 18 1,513,793 1,592,386 74 6.82 x 1011
Shakespeare 7,894,983 39 1,990,855 1,986,160 2115 7.46 x 1032
August 24, 2009 Optimizing XML Compression 19
δ = 120 bits
Additional bounding criterion used:Only nodes having top 60 mpGain scores explored at each tree level
Test System: Intel Core 2 Duo 3.16 GHz, 4GB RAM, Ubuntu 9.04 Desktop
B-and-BTime (m:s)
Baseball 0:10.441
DBLP 0:7.116
Weblog 0:10.893
Lineitem 3:54.483
Shakespeare 18:47.718
Future Work• Conduct more experiments involving the proposed
approximation algorithm in concert with various permutation-based XML compressors (e.g., AXECHOP, XMill)
• Seek improvements to the branch-and-bound procedure– Starting off with a “sprint” phase that greedily searches for the
best localGain would allow large portions of the search tree to be killed at an earlier stage
– Addition of an additional parameter that only explores top k nodes per tree level can greatly reduce memory and time costs
– Improve the efficiency of the existing implementation! • Investigate alternative approximation algorithms for
selecting a partitioning strategy
August 24, 2009 20Optimizing XML Compression
Final Slide
• Thank you• Questions?
August 24, 2009 21Optimizing XML Compression