Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Optimizing XML Compression

Gregory LeightonDenilson Barbosa

University of AlbertaAugust 24, 2009

Motivation• XML’s verbose nature (repeated subtrees, lengthy tag names,…) can greatly

inflate document size• Compression is an obvious solution

– “Generic” algorithms (gzip, bzip2) often yield good compression rates for XML, yet hinder query processing since they don’t preserve the XML structure of the original file one must decompress entire document before accessing individual nodes in the XML tree

– “XML-conscious” compression schemes have been developed which are capable of preserving document structure

• For the latter schemes, choosing the “best” compression strategy is very difficult– Compressing similar data values together can reduce storage costs, but tends to increase

time costs for compression and decompression

• Our goal: to “optimize” the performance of XML-conscious compression schemes by suggesting a good trade-off between space savings and time costs

August 24, 2009 Optimizing XML Compression 2

DataCompressor

DataCompressor

XML-Conscious Compression Schemes Permutation-based Approaches

Document is rearranged to localize repetitions before passing to back-end compressor(s)– Data segments are

grouped into different containers, typically based on the parent element’s type

– Tag structure (“skeleton”) and data segments are compressed separately


Skeleton

Shredder

Data Containers

StructureCompressor

DataCompressor

s

XML Document


users

user

@id favorites

movies music

user

@id favorites

movies

“SmartPeople”

moviemovie movie

title titleyear year

“2008”“Scary Movie

2”

title artist

song

“u-8026”

“2001”

“u-9125”

“A NewCareer in aNew Town”

“DavidBowie”

title year

“Smoke” “1995”

prestige prestige

“4.7” “3.9”

Path-Based Partitioning StrategyA

/users/user/@id

“u-8026”

“u-9125”

B/users/user/prestige

“4.7”

“3.9”

C/users/user/favorites/movies/movie/title

“Smart People”

“Scary Movie 2”

“Smoke”

D/users/user/favorites/movies/movie/year

“2008”

“2001”

“1995”

E/users/user/favorites/movies/movie/rating

“4.7”

“3.1”

“4.9”

F/users/user/favorites/music/song/title

“A New Career in a New Town”

G/users/user/favorites/music/song/artist

“David Bowie”

H/users/user/favorites/music/song/rating

“4.1”

August 24, 2009 5Optimizing XML Compression

Improving Permutation-Based XML Compression


• Selecting a different container partitioning strategy• Consider additional factors besides identity of parent element (e.g., data

type of content)• Grouping together containers with high pairwise similarity allows them

to share same compression model, reducing storage costs• Can improve random access performance – group together values

frequently appearing in same queries • Choosing “better” compression algorithms for container subsets

• Choose an algorithm best suited for the data type• Certain algorithms support operations over compressed data (e.g., the

Huffman algorithm supports equality comparisons)

Goal is to find an optimal compression configuration specifying a partitioning strategy and an algorithm assignment for each container subset such that overall compression gain is maximized

An Example Compression Configuration


C,F,G“Smart People”

“Scary Movie 2”

“Smoke”

“A New Career in a New Town”

“David Bowie”

A“u-8026”

“u-9125”

D“2008”

“2001”

“1995”

B,E,H“4.7”

“3.9”

“4.7”

“3.1”

“4.9”

“4.1”

Huffmancoding

LZ77

LZ77

Huffmancoding

Form a container subset consisting of containers C, F, and G based on intuition that these containers possess“similar” data (English text)

The contents of these 3 containers are concatenated and compressed using a single run of the LZ77 algorithm.

Containers B, E, and H store real numbers, so it may be beneficial to group them within a single container subset.

Terminology:

Container subset = a set of one or more containersContainer grouping = a set of one or more container subsetsPartitioning strategy = a container grouping that covers all containers

Evaluating Compression Configurations

• 3 relevant measures:– Storage gain: measures relative amount of space saved by

applying a specific compression algorithm a to a container subset S, denoted gain(S,a)

• considers not just the compressed size of S, but also the space requirements of the data structures constructed by the compression source model (e.g., Huffman tree)

– Compression and decompression time costs: measure how long it takes to apply/reverse the compression process of a on S, denoted as comp(S,a) and decomp(S,a)


Discovering an Optimal Compression Configuration

Problem: Supplied with algorithm set A, set of data containers C, upper bounds Tc and Td on compression/decompression time costs, find the compression configuration that maximizes the total storage gain while observing the bounds defined by Tc and Td.

Theorem: Selecting an optimal compression configuration is NP-hard.

“Hardness” is due to the inherent difficulty of selecting an optimal partitioning strategy; algorithm assignment requires polynomial time (w.r.t. |A| and |C|)


An Approximation Algorithm for Compression Configuration Selection

• Stage 1: Use a branch-and-bound procedure to discover a set of candidate partitioning strategies

• Stage 2: Test available compression algorithms against the set of candidate partitioning strategies, returning the resulting compression configuration that maximizes compression gain while observing specified bounds on compression/decompression time costs


Branch-and-Bound ProcedureEstimating Compressibility of Container Subsets

Shannon’s entropy rate indicates the best compression any lossless scheme can achieve– impractical to compute, so we instead turn to an approximate

measure of compressibility

LZ76: Lempel and Ziv (1976) proposed a measure of finite string complexity that asymptotically approaches the entropy rate

Idea: parse input string x from left-to-right, recursively building a set of phrases Px. Complexity CLZ(x) is given by the ratio of phrases per character:

CLZ(x) = |Px|/|x|


Branch-and-Bound ProcedureEstimating Storage Costs for a Container Subset

The storage cost (in bits) associated with a container subset S is computed as

storageCost(S) = t · (8 + log2(t))

where t is the number of phrases in the LZ76 dictionary for S


Represents the cost for encoding the “innovative”character each time a new dictionary entry is created

Using a fixed-length binary encoding, each dictionary phrase requires log2(t) bits, or t · log2(t) bits for all t phrases

The storage cost for a container grouping is the sum of storage costs of each container subset in that grouping.

Branch-and-Bound ProcedureModeling Compression Gain

Local compression gain (localGain): indicates the overall gain achieved by a grouping G; computed as the sum of localGain values of each container subset S in G

The localGain for each subset S is computed as


where

and |S| denotes the total byte length of the contents of S.

Uncompressed size of S

Estimated sizeof compressed S

Size of LZ76 dictionary for S

Branch-and-Bound ProcedureModeling Compression Gain

Maximum potential compression gain (mpGain): indicates an upper bound on the achievable localGain for any partitioning strategy that uses the current grouping G as a “starting point” (i.e., it preserves the subset placement of all containers present in G)

Maximized for a container subset when CLZ(S) and storageCost(S) are minimized– this happens when the longest applicable phrase is created at every

step during LZ76 parsing of S, by appending a new character to the end of the longest phrase currently in the dictionary

– mpGain calculation consists of building successively longer phrases, until all remaining characters in the original container set C have been processed


Example Gain CalculationAssume there are two containers, C1 = {aaabc} and C2, containing an additional 5 characters.

The localGain of C1 is computed by performing an LZ76 parsing of C1.

a a a b c Dictionary: <a> <aa> <b> <c>To compute mpGain(C1), we continue performing these steps until the 5 characters from C2 have been processed:

• Select longest phrase P in the dictionary, having length L

• If L < nUnprocessedChars, add to the dictionary a new phrase of length L + 1 by appending a new character to P and subtract L + 1 from nUnprocessedChars.

<aaa>

nUnProcessedChars: 5

• Otherwise, choose existing phrase of length nUnprocessedChars and stop.

CLZ(C1) = 4 phrases / 5 characters = 0.8

storageCost(C1) = 4 · (8 + log2(4)) = 40 bits

localGain(C1) = max{0, 5 · 8 – (0.8 · 5 + (4 · (8 + log2(4)))) } = 0

C1 should be left uncompressed

20

Updated CLZ = 5 phrases / 10 characters = 0.5

Updated storageCost = 5 ·(8 + log2(5)) ≈ 51.6096 bits

mpGain ≈ 10 · 8 – (0.5 · 10 + 51.6096) ≈ 23.3904 bitsAugust 24, 2009 15Optimizing XML Compression

Branch-and-Bound ProcedureSearch Tree

• Each tree node corresponds to a container grouping

• Each node stores local compression gain (localGain) and maximum potential compression gain (mpGain) values for its container grouping

• Nodes at depth i in tree represent all possibilities for assigning container i to a container grouping

• Crucial to avoid enumerating the entire search tree…


{C1}

{C1},{C2} {C1,C2}

Bounding criterion:Kill each subtree rooted at a node n for which

mpGain(n) < optGain - δ

Best local gainwitnessed so far

Threshold value in+

Remaining nodes at the bottom level of search tree represent the set of candidate partitioning strategies.

Example: C = {C1,C2,C3}, where C1={aaabcaaabcaaabcabcab}, C2={15720653197608243849},C3={abcababcbaaaabcabcab}, δ = 15.0 bits

{C1}localGain = 50.4707mpGain = 300.6970

{C1},{C2}localGain = 50.4707mpGain = 168.9804

{C1,C2}localGain = 0

mpGain = 108.6180

{C1},{C2},{C3}localGain = 100.9413mpGain = 100.9413

{C1,C3},{C2}localGain = 126.3966mpGain = 126.3966

{C1},{C2,C3}localGain = 50.4707mpGain = 50.4707


Left child has highest mpGain, so we explore it first

mpGains of 1st and 3rd children are less than 126.3966 – 15.0 = 111.3966, so kill both

mpGain is less than 126.3966 – 15.0 = 111.3966, so kill without exploring this subtree

{C1,C3},{C2} is the only candidate partitioning strategy

Determining an Optimal Compression Configuration

For each candidate partitioning strategy G returned by the branch-and-bound procedure:1. For each container subset S in G, assign the compression

algorithm that achieves the best compression gain while obeying specified time bounds on compression/decompression.

2. Compute the overall compression gain for G as the sum of gains for each S in G.

Choose the G with the highest overall compression gain, together with the corresponding algorithm assignment, as the compression configuration.


Early ResultsOriginal

Size (B)

# of Containers/Search Tree

Height

Default Config/XMill +

zlib

SuggestedConfig/XMill +

zlib

ExploredNodes

TotalNodes

Baseball 671,924 41 35,554 35,001 2134 2.35 x 1036

DBLP 97,190 28 17,277 15,217 1383 6.16 x 1021

Weblog 2,648,181 10 74,450 74,421 199 115,975

Lineitem 32,295,475 18 1,513,793 1,592,386 74 6.82 x 1011

Shakespeare 7,894,983 39 1,990,855 1,986,160 2115 7.46 x 1032


δ = 120 bits

Additional bounding criterion used:Only nodes having top 60 mpGain scores explored at each tree level

Test System: Intel Core 2 Duo 3.16 GHz, 4GB RAM, Ubuntu 9.04 Desktop

B-and-BTime (m:s)

Baseball 0:10.441

DBLP 0:7.116

Weblog 0:10.893

Lineitem 3:54.483

Shakespeare 18:47.718

Future Work• Conduct more experiments involving the proposed

approximation algorithm in concert with various permutation-based XML compressors (e.g., AXECHOP, XMill)

• Seek improvements to the branch-and-bound procedure– Starting off with a “sprint” phase that greedily searches for the

best localGain would allow large portions of the search tree to be killed at an earlier stage

– Addition of an additional parameter that only explores top k nodes per tree level can greatly reduce memory and time costs

– Improve the efficiency of the existing implementation! • Investigate alternative approximation algorithms for

selecting a partitioning strategy


Final Slide

• Thank you• Questions?


Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Documents

xml structure

performance of xml

compression model

optimizing xml compression6

document size compression

best compression strategy

better compression algorithms

optimizing xml compression7