Top Banner
Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009
21

Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Dec 22, 2015

Download

Documents

Carrie Wixon
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Optimizing XML Compression

Gregory LeightonDenilson Barbosa

University of AlbertaAugust 24, 2009

Page 2: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Motivation• XML’s verbose nature (repeated subtrees, lengthy tag names,…) can greatly

inflate document size• Compression is an obvious solution

– “Generic” algorithms (gzip, bzip2) often yield good compression rates for XML, yet hinder query processing since they don’t preserve the XML structure of the original file one must decompress entire document before accessing individual nodes in the XML tree

– “XML-conscious” compression schemes have been developed which are capable of preserving document structure

• For the latter schemes, choosing the “best” compression strategy is very difficult– Compressing similar data values together can reduce storage costs, but tends to increase

time costs for compression and decompression

• Our goal: to “optimize” the performance of XML-conscious compression schemes by suggesting a good trade-off between space savings and time costs

August 24, 2009 Optimizing XML Compression 2

Page 3: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

DataCompressor

DataCompressor

XML-Conscious Compression Schemes Permutation-based Approaches

Document is rearranged to localize repetitions before passing to back-end compressor(s)– Data segments are

grouped into different containers, typically based on the parent element’s type

– Tag structure (“skeleton”) and data segments are compressed separately

August 24, 2009 Optimizing XML Compression 3

Skeleton

Shredder

Data Containers

StructureCompressor

DataCompressor

s

XML Document

Page 4: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

August 24, 2009 Optimizing XML Compression 4

users

user

@id favorites

movies music

user

@id favorites

movies

“SmartPeople”

moviemovie movie

title titleyear year

“2008”“Scary Movie

2”

title artist

song

“u-8026”

“2001”

“u-9125”

“A NewCareer in aNew Town”

“DavidBowie”

title year

“Smoke” “1995”

prestige prestige

“4.7” “3.9”

Page 5: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Path-Based Partitioning StrategyA

/users/user/@id

“u-8026”

“u-9125”

B/users/user/prestige

“4.7”

“3.9”

C/users/user/favorites/movies/movie/title

“Smart People”

“Scary Movie 2”

“Smoke”

D/users/user/favorites/movies/movie/year

“2008”

“2001”

“1995”

E/users/user/favorites/movies/movie/rating

“4.7”

“3.1”

“4.9”

F/users/user/favorites/music/song/title

“A New Career in a New Town”

G/users/user/favorites/music/song/artist

“David Bowie”

H/users/user/favorites/music/song/rating

“4.1”

August 24, 2009 5Optimizing XML Compression

Page 6: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Improving Permutation-Based XML Compression

August 24, 2009 Optimizing XML Compression 6

• Selecting a different container partitioning strategy• Consider additional factors besides identity of parent element (e.g., data

type of content)• Grouping together containers with high pairwise similarity allows them

to share same compression model, reducing storage costs• Can improve random access performance – group together values

frequently appearing in same queries • Choosing “better” compression algorithms for container subsets

• Choose an algorithm best suited for the data type• Certain algorithms support operations over compressed data (e.g., the

Huffman algorithm supports equality comparisons)

Goal is to find an optimal compression configuration specifying a partitioning strategy and an algorithm assignment for each container subset such that overall compression gain is maximized

Page 7: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

An Example Compression Configuration

August 24, 2009 Optimizing XML Compression 7

C,F,G“Smart People”

“Scary Movie 2”

“Smoke”

“A New Career in a New Town”

“David Bowie”

A“u-8026”

“u-9125”

D“2008”

“2001”

“1995”

B,E,H“4.7”

“3.9”

“4.7”

“3.1”

“4.9”

“4.1”

Huffmancoding

LZ77

LZ77

Huffmancoding

Form a container subset consisting of containers C, F, and G based on intuition that these containers possess“similar” data (English text)

The contents of these 3 containers are concatenated and compressed using a single run of the LZ77 algorithm.

Containers B, E, and H store real numbers, so it may be beneficial to group them within a single container subset.

Terminology:

Container subset = a set of one or more containersContainer grouping = a set of one or more container subsetsPartitioning strategy = a container grouping that covers all containers

Page 8: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Evaluating Compression Configurations

• 3 relevant measures:– Storage gain: measures relative amount of space saved by

applying a specific compression algorithm a to a container subset S, denoted gain(S,a)

• considers not just the compressed size of S, but also the space requirements of the data structures constructed by the compression source model (e.g., Huffman tree)

– Compression and decompression time costs: measure how long it takes to apply/reverse the compression process of a on S, denoted as comp(S,a) and decomp(S,a)

August 24, 2009 8Optimizing XML Compression

Page 9: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Discovering an Optimal Compression Configuration

Problem: Supplied with algorithm set A, set of data containers C, upper bounds Tc and Td on compression/decompression time costs, find the compression configuration that maximizes the total storage gain while observing the bounds defined by Tc and Td.

Theorem: Selecting an optimal compression configuration is NP-hard.

“Hardness” is due to the inherent difficulty of selecting an optimal partitioning strategy; algorithm assignment requires polynomial time (w.r.t. |A| and |C|)

August 24, 2009 9Optimizing XML Compression

Page 10: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

An Approximation Algorithm for Compression Configuration Selection

• Stage 1: Use a branch-and-bound procedure to discover a set of candidate partitioning strategies

• Stage 2: Test available compression algorithms against the set of candidate partitioning strategies, returning the resulting compression configuration that maximizes compression gain while observing specified bounds on compression/decompression time costs

August 24, 2009 10Optimizing XML Compression

Page 11: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Branch-and-Bound ProcedureEstimating Compressibility of Container Subsets

Shannon’s entropy rate indicates the best compression any lossless scheme can achieve– impractical to compute, so we instead turn to an approximate

measure of compressibility

LZ76: Lempel and Ziv (1976) proposed a measure of finite string complexity that asymptotically approaches the entropy rate

Idea: parse input string x from left-to-right, recursively building a set of phrases Px. Complexity CLZ(x) is given by the ratio of phrases per character:

CLZ(x) = |Px|/|x|

August 24, 2009 11Optimizing XML Compression

Page 12: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Branch-and-Bound ProcedureEstimating Storage Costs for a Container Subset

The storage cost (in bits) associated with a container subset S is computed as

storageCost(S) = t · (8 + log2(t))

where t is the number of phrases in the LZ76 dictionary for S

August 24, 2009 12Optimizing XML Compression

Represents the cost for encoding the “innovative”character each time a new dictionary entry is created

Using a fixed-length binary encoding, each dictionary phrase requires log2(t) bits, or t · log2(t) bits for all t phrases

The storage cost for a container grouping is the sum of storage costs of each container subset in that grouping.

Page 13: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Branch-and-Bound ProcedureModeling Compression Gain

Local compression gain (localGain): indicates the overall gain achieved by a grouping G; computed as the sum of localGain values of each container subset S in G

The localGain for each subset S is computed as

August 24, 2009 13Optimizing XML Compression

where

and |S| denotes the total byte length of the contents of S.

Uncompressed size of S

Estimated sizeof compressed S

Size of LZ76 dictionary for S

Page 14: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Branch-and-Bound ProcedureModeling Compression Gain

Maximum potential compression gain (mpGain): indicates an upper bound on the achievable localGain for any partitioning strategy that uses the current grouping G as a “starting point” (i.e., it preserves the subset placement of all containers present in G)

Maximized for a container subset when CLZ(S) and storageCost(S) are minimized– this happens when the longest applicable phrase is created at every

step during LZ76 parsing of S, by appending a new character to the end of the longest phrase currently in the dictionary

– mpGain calculation consists of building successively longer phrases, until all remaining characters in the original container set C have been processed

August 24, 2009 14Optimizing XML Compression

Page 15: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Example Gain CalculationAssume there are two containers, C1 = {aaabc} and C2, containing an additional 5 characters.

The localGain of C1 is computed by performing an LZ76 parsing of C1.

a a a b c Dictionary: <a> <aa> <b> <c>To compute mpGain(C1), we continue performing these steps until the 5 characters from C2 have been processed:

• Select longest phrase P in the dictionary, having length L

• If L < nUnprocessedChars, add to the dictionary a new phrase of length L + 1 by appending a new character to P and subtract L + 1 from nUnprocessedChars.

<aaa>

nUnProcessedChars: 5

• Otherwise, choose existing phrase of length nUnprocessedChars and stop.

CLZ(C1) = 4 phrases / 5 characters = 0.8

storageCost(C1) = 4 · (8 + log2(4)) = 40 bits

localGain(C1) = max{0, 5 · 8 – (0.8 · 5 + (4 · (8 + log2(4)))) } = 0

C1 should be left uncompressed

20

Updated CLZ = 5 phrases / 10 characters = 0.5

Updated storageCost = 5 ·(8 + log2(5)) ≈ 51.6096 bits

mpGain ≈ 10 · 8 – (0.5 · 10 + 51.6096) ≈ 23.3904 bitsAugust 24, 2009 15Optimizing XML Compression

Page 16: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Branch-and-Bound ProcedureSearch Tree

• Each tree node corresponds to a container grouping

• Each node stores local compression gain (localGain) and maximum potential compression gain (mpGain) values for its container grouping

• Nodes at depth i in tree represent all possibilities for assigning container i to a container grouping

• Crucial to avoid enumerating the entire search tree…

August 24, 2009 16Optimizing XML Compression

{C1}

{C1},{C2} {C1,C2}

Bounding criterion:Kill each subtree rooted at a node n for which

mpGain(n) < optGain - δ

Best local gainwitnessed so far

Threshold value in+

Remaining nodes at the bottom level of search tree represent the set of candidate partitioning strategies.

Page 17: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Example: C = {C1,C2,C3}, where C1={aaabcaaabcaaabcabcab}, C2={15720653197608243849},C3={abcababcbaaaabcabcab}, δ = 15.0 bits

{C1}localGain = 50.4707mpGain = 300.6970

{C1},{C2}localGain = 50.4707mpGain = 168.9804

{C1,C2}localGain = 0

mpGain = 108.6180

{C1},{C2},{C3}localGain = 100.9413mpGain = 100.9413

{C1,C3},{C2}localGain = 126.3966mpGain = 126.3966

{C1},{C2,C3}localGain = 50.4707mpGain = 50.4707

August 24, 2009 17Optimizing XML Compression

Left child has highest mpGain, so we explore it first

mpGains of 1st and 3rd children are less than 126.3966 – 15.0 = 111.3966, so kill both

mpGain is less than 126.3966 – 15.0 = 111.3966, so kill without exploring this subtree

{C1,C3},{C2} is the only candidate partitioning strategy

Page 18: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Determining an Optimal Compression Configuration

For each candidate partitioning strategy G returned by the branch-and-bound procedure:1. For each container subset S in G, assign the compression

algorithm that achieves the best compression gain while obeying specified time bounds on compression/decompression.

2. Compute the overall compression gain for G as the sum of gains for each S in G.

Choose the G with the highest overall compression gain, together with the corresponding algorithm assignment, as the compression configuration.

August 24, 2009 18Optimizing XML Compression

Page 19: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Early ResultsOriginal

Size (B)

# of Containers/Search Tree

Height

Default Config/XMill +

zlib

SuggestedConfig/XMill +

zlib

ExploredNodes

TotalNodes

Baseball 671,924 41 35,554 35,001 2134 2.35 x 1036

DBLP 97,190 28 17,277 15,217 1383 6.16 x 1021

Weblog 2,648,181 10 74,450 74,421 199 115,975

Lineitem 32,295,475 18 1,513,793 1,592,386 74 6.82 x 1011

Shakespeare 7,894,983 39 1,990,855 1,986,160 2115 7.46 x 1032

August 24, 2009 Optimizing XML Compression 19

δ = 120 bits

Additional bounding criterion used:Only nodes having top 60 mpGain scores explored at each tree level

Test System: Intel Core 2 Duo 3.16 GHz, 4GB RAM, Ubuntu 9.04 Desktop

B-and-BTime (m:s)

Baseball 0:10.441

DBLP 0:7.116

Weblog 0:10.893

Lineitem 3:54.483

Shakespeare 18:47.718

Page 20: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Future Work• Conduct more experiments involving the proposed

approximation algorithm in concert with various permutation-based XML compressors (e.g., AXECHOP, XMill)

• Seek improvements to the branch-and-bound procedure– Starting off with a “sprint” phase that greedily searches for the

best localGain would allow large portions of the search tree to be killed at an earlier stage

– Addition of an additional parameter that only explores top k nodes per tree level can greatly reduce memory and time costs

– Improve the efficiency of the existing implementation! • Investigate alternative approximation algorithms for

selecting a partitioning strategy

August 24, 2009 20Optimizing XML Compression

Page 21: Optimizing XML Compression Gregory Leighton Denilson Barbosa University of Alberta August 24, 2009.

Final Slide

• Thank you• Questions?

August 24, 2009 21Optimizing XML Compression