-
SOFTWARE: PRACTICE AND EXPERIENCESoftw. Pract. Exper. 2016;
46:1201–1217Published online 11 November 2015 in Wiley Online
Library (wileyonlinelibrary.com). DOI: 10.1002/spe.2377
Improving a lightweight LZ77 computation algorithm forrunning
faster
Wei Jun Liu1,2, Ge Nong1,3,*,† , Wai hong Chan4,*,† and Yi
Wu1
1Computer Science Department, Sun Yat-sen University, Guangzhou,
China2School of Physics and Electronic Information, Gannan Normal
University, Ganzhou, China
3SYSU-CMU Shunde International Joint Research Institute, Shunde,
China4Department of Mathematics and Information Technology, The
Hong Kong Institute of Education, Hong Kong
SUMMARY
Computing the Lempel–Ziv factorization (LZ77) of a string is a
key step in many applications. However,at the same time, it
constitutes a bottleneck of the entire computation. The
investigation of time and spaceefficient computation of the LZ77
has become an important topic. In this paper, we present a
lightweightlinear-time algorithm called LZone for computing the
LZ77, which is designed by improvements on theexisting linear-time
space efficient LZ77 algorithm BGone for speed acceleration. For an
input string T Œ1::n�over a constant alphabet size of O.1/, LZone
requires only n words of workspace in addition to theinput string
and the output factorization, dlogne bits per word. This is the
same space requirement for thealgorithm BGone. LZone has two
versions, LZoneT and LZoneSA, corresponding to BGoneT andBGoneSA,
respectively. Our experimental results show that for computing the
LZ77 from an input string T ,LZoneT and LZoneSA run at around 26%
and 57%, respectively, faster than their counterparts in
BGone.Moreover, for computing the LZ77 from the suffix array of T ,
the speed of LZoneSA is on average twicethat of BGoneSA. Copyright
© 2015 John Wiley & Sons, Ltd.
Received 23 December 2014; Revised 4 August 2015; Accepted 19
October 2015
KEY WORDS: Lempel–Ziv factorization; algorithm; linear time;
lightweight; data compression; suffixarray
1. INTRODUCTION
The Lempel–Ziv factorization (LZ77) [1], named after its authors
Abraham Lempel and Jacob Ziv,is an important concept in computer
science. Because of its high efficiency in string processing,it has
been widely used in many applications such as file compression
[2–6], pattern discovery,sequence alignments, and full-text indexes
[7, 8]. Very recently, several powerful instances havetestified
that compression schemes based on LZ77 are effective in modern
datasets, especially inthe collections of highly repetitive
characteristics such as Genome sequences [9, 10]. However,in all
those applications, computing LZ77 has been shown to be a time and
space bottleneck inpractice [11].
Unless otherwise specified, in this paper, the input string
assumes a constant alphabet size ofO.1/, this is commonly satisfied
by realistic data such as a text or bioinformatics database.
Althougha variety of worst-case linear time algorithms appeared
over the years [12, 13], there has been stillmuch research work
carried out on making the computation of LZ77 to be more time and
spaceefficient. Until now, the fastest linear time algorithms,
KKP1, KKP2, and KKP3 [11], for LZ77 wereproposed by Kärkkäinen et
al. in 2013. Among the three algorithms, KKP3 is the fastest,
which
*Correspondence to: Ge Nong, Computer Science Department, Sun
Yat-sen University, Guangzhou, China; Wai HongChan, Department of
Mathematics and Information Technology, The Hong Kong Institute of
Education, Hong Kong.
†E-mail: [email protected]; [email protected]
Copyright © 2015 John Wiley & Sons, Ltd.
-
1202 W. J. LIU ET AL.
utilizes three size-n integer arrays including the suffix array
SA and the other two arrays PSVand NSV to store the previous
smaller values (PSVs) and next smaller values (NSVs) of the
inputstring T , respectively. Given the SA of an input string T ,
KKP3 first computes the PSV and NSVsimultaneously and then the
LZ77. KKP2 uses one less integer array by computing the NSV onlyin
the preliminary step and then computes the PSV on-the-fly in the
parsing step by making useof the relationships between PSV , NSV ,
and the array ˆ (first defined in [14]). In KKP1, SA isstored on
the disk and streamed from the disk when computing the NSV . Apart
from that, there isno other differences between KKP1 and KKP2.
Although KKP1 holds only one single integer arrayin the memory, the
total space requirement of KKP1 is still two integer arrays.
Currently, many algorithms for computing LZ77 (including KKP1,
KKP2, and KKP3) are basedon suffix array with the workspace
(excluding the space for storing the input and the output) notless
than two integer arrays. That is, the total workspace is at least
2n words with dlogne bits perword. Recently, a space economical and
linear time LZ77 factorization algorithm called BGone[15, 16],
which was proposed by Goto and Bannai, uses only an integer array
as the workspace,that is, the workspace is n words. This is
performed by constructing the array ˆ by simulating thesorting
process of SACA-K [17], and the in-place computation of NSV from ˆ.
BGone has twoversions called BGoneT and BGoneSA, the former
computes the LZ77 directly from T , while thelatter computes the
LZ77 from the suffix array of T . Provided that the suffix array
has been obtainedin some applications, BGoneSA can be employed to
save the runtime.
While BGone brings a lightweight solution for the LZ77
computation, the experiments in[15, 16] show that its speed is
quite slow, and techniques for acceleration are desired.
Motivatedby this, we present here an algorithm called LZone to
compute LZ77 in O.n/ time and n-wordworkspace. LZone is designed by
improvements on BGone for speed acceleration, it has the
sametheoretical time and space complexities as BGone, but it can
run much faster in practice. We achievethis by using much nature of
suffix array and a series of techniques for rewriting the
differentauxiliary integer arrays from one to another. LZone also
has two versions, LZoneT and LZoneSA,which correspond to BGoneT and
BGoneSA, respectively. Our experimental results in Section 5show
that, with the same space requirements, LZoneT and LZoneSA are
around 26% and 57% fasterthan BGoneT and BGoneSA, respectively. We
also compared the speed of LZone with that of KKP,LZone is slower
than KKP. But on the other hand, LZone uses only half the total
space. Thus, LZonemay be a better choice when the total space
(including both RAM and disk) is an important factorto consider.
For example, WRT-LZ77 [18] and XML-WRT [19] are two text
preprocessors that canboost the speed and compression performance
of gzip, each uses a space-consuming dictionary tokeep a large
collection of words. In this case, LZone can be applied for LZ77
computation to savespace for the dictionary.
As a summary, the contribution of our work consists of two
components:
� We design an algorithm LZone for computing the LZ77 of a
size-n input string of a constantalphabet. Being linear-time and
using only one size-n integer array, LZone provides a newchoice for
space succinct applications. LZone uses a new method to compute ˆ`
or ‰s , thismakes it run much faster than BGone. This method is
explained in details to help readers seethe differences between
LZone and BGone.� An experimental study is conducted for
performance evaluation of our algorithms against
the others, including both linear-time and non-linear-time
algorithms, such as KKP,BGone/BGtwo, LZscan, and ISA6s. Moreover,
we also include the time results of gzip forcomparison to give some
rough ideas how far our algorithm being away from the LZ77
compu-tation engine in the popular compression software gzip that
uses both LZ77 and Huffman code.It is believed that gzip has been
fine-tuned for speed, although its space consumption may notbe
optimal.
Because LZoneSA is only a variant of LZoneT under the assumption
that the suffix array isalready given, in the rest of this paper,
we shall present the details of LZoneT, which uses only onesingle
integer array workspace for computing the LZ77 directly from the
input string.
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
-
IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING
FASTER 1203
2. PRELIMINARIES
Let † be a constant alphabet of size O.1/, and T D T Œ1::n� D T
Œ1�T Œ2�: : : T Œn� be a string of ncharacters from †, where T Œi �
is the i-th character of T . The length of a string X is denoted
byjX j, for example, jT j D n. T Œi::j � with 1 6 i 6 j 6 n is a
substring of T consisting of the i-th tothe j -th characters, that
is, T Œi::j � D T Œi �T Œi C 1�::T Œj �. For convenience, we assume
that the lastcharacter of T is the sentinel $, that is, T Œn� D $,
which is the unique lexicographically smallestcharacter in T . The
size of the alphabet of an input string T is denote by � or
j†j.
2.1. Suffix and suffix array
For i D 1; : : : ; n, the suffix of T starting with the i-th
character of T is denoted by T Œi::n� orsuf.T; i/, and the prefix
of T ending at the i-th character of T is denoted by T Œ1::i � or
pre.T; i/.The length of the longest common prefix (LCP) of suf.T;
i/ and suf.T; j / is denoted by lcp.i; j /,e.g., for T D aabbaabb,
lcp.1; 2/ D 1 D jaj, and lcp.1; 5/ D 4 D jaabbj.
A suffix suf.T; i/ is S-type if T Œi � < T Œi C 1�, or T Œi �
D T Œi C 1� and suf.T; i C 1/ is S-type. Asuffix suf.T; i/ is
L-type if T Œi � > T ŒiC1�, or T Œi � D T ŒiC1� and suf.T; iC1/
is L-type. The lastsuffix suf.T; n/ is always S-type. A suffix
suf.T; i/ is leftmost S-type (LMS) if suf.T; i/ is S-typeand suf.T;
i � 1/ is L-type. Symmetrically, a suffix suf.T; i/ is leftmost
L-type (LML) if suf.T; i/is L-type and suf.T; i � 1/ is S-type. The
last suffix suf.T; n/ is always LMS, while the first suffixsuf.T;
1/ is neither LMS nor LML. An L-type suffix is also called L-suffix
for short, so are S-type,LMS or LML suffixes. The type of a
character T Œi � is defined as the type of suf.T; i/.
The suffix array of an input string T is an array of length n,
denoted by SA, indicating thelexicographical order of all the
suffixes of T . That is, for any 1 6 i < j 6 n, suf.T; SAŒi �/
<suf.T; SAŒj �/, or T ŒSAŒi �::n� < T ŒSAŒj �::n�. The
inverse array of SA, denoted by ISA, is also anarray of length n,
which is the inverse permutation of SA such that ISAŒSAŒi �� D i .
ISAŒi� D jmeans that suf.T; i/ is at the j -th position in SA. The
arrays storing all the sorted L-type, S-type,LMS, and LML suffixes
are, respectively, denoted as SA`, SAs , SAlms and SAlml .
For example, given T D mmiissiiss$, Table I shows that the type
of each character in string T(marked by ‘L’, ‘S’, and ‘�’), the
arrays SA, ISA and so on, where the arrays SA`, SAs , ˆ, and ‰will
be introduced in Section 2.2.
2.2. ˆ and ‰
We define ˆŒSAŒ1�� = 0, ˆ[0] = SAŒn� and ˆŒi� = SAŒISAŒi �� 1�
for i 2 Œ1; n� n SAŒ1� to make ˆa cycle of length jT jC 1 to
indicate the predecessor of each suffix of T in the lexicographical
order.Similarly, we define‰ŒSAŒn�� = 0,‰Œ0� = SAŒ1�, and‰Œi� =
SAŒISAŒi �C1� for i 2 Œ1; n�nSAŒn� tomake ‰ a cycle of length jT jC
1 to indicate the successor of each suffix of T in the
lexicographicalorder. Clearly, ‰ is the inverse of ˆ. Given the
array ˆ, starting from ˆŒ0�, we can visit all of the
Table I. An example for LMS, LML, SA, ISA, SA`, SAs , ˆ and ‰ so
on.
Index 0 1 2 3 4 5 6 7 8 9 10 11
T m m i i s s i i s s $L/S L L S S L L S S L L SLMS � � �LML �
�SA 11 7 3 8 4 2 1 10 6 9 5ISA 7 6 3 5 11 9 2 4 10 8 1SAs 11 7 3 8
4SA` 2 1 10 6 9 5SAlms 11 7 3SAlml 9 5ˆ 5 2 4 7 8 9 10 11 3 6 1 0‰
11 10 1 8 2 0 9 3 4 5 6 7
LMS, leftmost S-type; LML, leftmost L-type.
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
-
1204 W. J. LIU ET AL.
suffixes of T from the largest to the smallest. Analogously,
given the array ‰, starting from ‰Œ0�,we can visit all of the
suffixes from the smallest to the largest. So,ˆ array can be viewed
as an arraybased linked list which is another way to store SA, so
does ‰ array.
Note that ˆ and ‰ are defined using SA. Similarly, we define ˆ`,
‰s and ‰lms using SA`, SAsand SAlms , respectively, as follows. (i)
ˆ`ŒSA`Œ1�� = 0, ˆ`[0] = SA`ŒjSA`j� and ˆ`Œi � is the imme-diate
lexicographical predecessor of suf.T; i/ for i 2 Œ1; jSA`j� n
SA`Œ1� to make ˆ` a cycle oflength jSA`jC1. (ii)‰sŒSAsŒjSAsj�� =
0,‰s[0] = SAsŒj1j� and‰sŒi � is the immediate lexicograph-ical
successor of suf.T; i/ for i 2 Œ1; jSAsj� n SAsŒjSAsj� to make ‰s a
cycle of length jSAsj C 1.(iii) ‰lmsŒSAlmsŒjSAlmsj�� = 0, ‰lms[0] =
SAlmsŒ1� and ‰lmsŒi � is the immediate lexicographi-cal successor
of suf.T; i/ for i 2 Œ1; jSAlmsj� n SAlmsŒjSAlmsj� to make ‰lms a
cycle of lengthjSAlmsj C 1.
Givenˆ`, all of the L-suffixes of T can be visited from the
largest to the smallest. However, given‰lms or ‰s , all of the
LMS-suffixes or S-suffixes, respectively, can be visited from the
smallest tothe largest.
2.3. Lempel–Ziv factorization
The LZ77 introduces the concept of longest previous factor. The
pair (pi , li ) is the longest previousfactor of position i in T ,
such that, for any 1 6 i 6 n, T Œpi ::pi C li � 1� D T Œi::i C li �
1�, wherepi < i and li > 0 is maximized. That is, T Œi::iC li
� 1� is the longest prefix of suf.T; i/ that occursat least once
before i . If T Œi � does not occur before i , then pi D T Œi � and
li D 0.
The formation of the LZ77 of a string T is a left-to-right,
greedy process that parses the string Tinto the longest previous
factors. In each parsing phrase i (corresponding to T Œj �), an
ordered pairwill be acquired. Then, the next phrase starts at
position j C li if li > 0 or starts at position j C 1if
otherwise.
For example, given T D mmiissiiss, the LZ77 of T is as
follows:
.m; 0/; .1; 1/; .i; 0/; .3; 1/; .s; 0/; .5; 1/; and.3; 4/:
2.4. Next and previous smaller values
Crochemore and Ilie [20] showed that pi can be computed by the
NSVs/PSVs, which are defined as
PSV ŒSAŒi �� D SAŒj1� and NSV ŒSAŒi �� D SAŒj2�;
where j1 D max¹j 2 Œ1; i/jSAŒj � < SAŒi�º and j2 D min¹j 2
.i; n�jSAŒj � < SAŒi�º. If j1(or j2) does not exist, we set j1
(or j2) equal to 0.
2.5. Lazy LZ factorization
The lazy LZ factorization is actually lazy evaluation of LCP
values. For example, lcp.i; NSV Œi �/and lcp.i; PSV Œi �/ will not
be computed until i is a starting position of a phrase. This trick
is usedin the recent fast LZ factorization algorithms [16, 21, 22]
and the currently fastest LZ factorizationalgorithm KKP3 [11].
During the process of computing the LCP values, the characters of
two suf-fixes are compared one-by-one, and the total time
complexity isO.n/. Our new algorithm presentedin this paper also
utilizes the lazy LZ factorization.
In the rest of the paper, we assume that all LZ factors are
sequentially acquired from left to right,and the space for storing
the LZ factors is excluded from the workspace.
3. PRIOR ARTS
3.1. KKP3, KKP2 and KKP1
The process of LZ factorization in each of KKP3, KKP2, and KKP1
is composed of two commonsteps, the preliminary step and the
parsing step. For example, in KKP3, in the preliminary step,
thePSVs and NSVs for all the positions are computed by sequentially
scanning the SA of T in O.n/
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
-
IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING
FASTER 1205
time, which makes use of the technique of peak elimination
(originated from [20]). Then, in theparsing step, the LZ factors
can be acquired by repeatedly calling the LZ-factor function in
O.n/time. It is clear that KKP3 runs in O.n/ time and requires
three auxiliary integer arrays (SA, PSV ,and NSV ) of length jT
j.
In KKP2, Xt D ¹T Œi::n�ji 6 tº for t 2 Œ1::n�. This means that
Xt contains all suffixes of T start-ing at or before the t -th
character. Let ˆt be ˆ restricted to Xt , that is, for t 2 Œ1::n�,
ˆt Œi � is theimmediate lexicographical predecessor of suf.T; i/
among the suffixes in Xt . In order to make ˆt acomplete unicyclic
permutation like ˆ, ˆt Œimin� is set to 0, where suf.T; imin/ is
the lexicograph-ically smallest suffix in Xt . Also, ˆt Œ0� is set
to imax , where suf.T; imax/ is the lexicographicallylargest suffix
in Xt . When t D n, ˆt = ˆ. Thus, in the preliminary step, only the
NSVs need to becomputed. Because the PSVs can be computed in-place
on-the-fly in the parsing step by scanningand rewriting the NSV
sequentially. That is, in the parsing step, T is parsed from left
to right. Inthe t -th round, ˆt�1 has already been obtained and
stored in NSV Œ1::t � 1�, and so PSV Œt � can beacquired by
ˆt�1ŒNSV Œt ��. Because NSV Œt � is not needed to be kept after it
has been processed,the space for keeping NSV Œt � can then be used
for storing ˆŒt� D PSV Œt �. After the t -th round inthe parsing
step, NSV Œ1::t � D ˆt and NSV Œt C 1::n� remain unchanged.
As described previously, KKP2 requires two integer arrays (SA
and NSV ) in the first step andone integer array (NSV ) in the
second step. Thus, KKP2 runs in O.n/ time using two auxiliarysize-n
integer arrays. In KKP1, SA is stored in the disk. When computing
the NSVs, KKP1 streamsthe suffix array from the disk. Thus, there
is only one integer array (i.e., NSV ) that is kept in themain
memory, but the total space requirement is still two integer arrays
(i.e., SA and NSV ).
3.2. BGone
BGone computes the LZ77 using one single integer array. We first
describe BGoneT, which com-putes the LZ77 directly from T . Let
AŒ1::n� be a size-n integer array, A is called the working arrayfor
BGone, which is reused to store SAlms , ˆ or NSV in different steps
of BGone. In order toobtain the LZ factors, BGoneT conducts the
following steps.
1. Compute SAlms .2. Compute ‰lms from SAlms .3. Compute ˆ` from
‰lms .4. Compute ˆ from the result of ˆ`.5. Compute NSV from ˆ.6.
Compute LZ77 from NSV .
BGoneSA, which is another version of BGone, computes the LZ77
from the SA of T . The differ-ence between BGoneT and BGoneSA is
how to obtain SAlms . BGoneT calls SACA-K to computeSAlms directly
from the input string T , while BGoneSA obtains SAlms by scanning
the SA fromleft to right to find all of the LMS-suffixes.
4. OUR WORK
4.1. LZoneT
Our algorithm LZone, which has two variations of LZoneT and
LZoneSA, is described in thissection. The algorithmic framework of
LZoneT is shown in Algorithm 1. The workspace of LZoneremains
identical as that of BGone, that is, the work array A only (Section
3.2). The crucial differ-ence between BGone and LZone resides in
their methods for computingˆ, that is, BGone computesˆ from SAlms ,
but LZone computes ˆ from SA` or ‰ from SAs . LZoneT first counts
the num-ber of the L-suffixes and S-suffixes. If the S-suffixes
outnumber the L-suffixes, LZoneT computesSA` from T ; otherwise
computes SAs . Computing SAlms in LZoneT is not required because of
thefollowing reasons.
1. If the number of L-suffixes is less than that of S-suffixes,
all of the L-suffixes can be linkedfrom the lexicographically
largest one to the smallest by putting the L-suffixes in the
S-type
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
-
1206 W. J. LIU ET AL.
positions of the working array A (the i-th position in A is
S-type or L-type depending onwhether T Œi � is an S-suffix or
L-suffix respectively). Thus, we can obtainˆ`. When L-suffixesare
more than S-suffixes, we obtain ‰s by putting the S-suffixes in the
L-type positions of A.
2. Given ‰s , PSV can be computed in linear time and use only
one integer array plus O.1/extra workspace by rewriting ‰ in-place.
Once PSV is obtained, LZoneT can compute NSVfrom PSV on-the-fly in
the parsing step, which is a symmetric process to that of KKP2
incomputing PSV fromNSV [11]. In a similar way, givenˆ`,NSV can be
computed in lineartime using O.1/ workspace by rewriting ˆ
in-place. Once NSV is obtained, PSV can besequentially acquired in
the parsing step (see KKP2n in [11]).
3. By directly calling SACA-K, we can easily obtain SA` or SAs
and then compute ˆ` or ‰s ,respectively, by step (1). Once either
ˆ` or ‰s is obtained, we need only one scan to computeˆ or ‰,
respectively. In this way, we do not need to sort the L-suffixes
the same way asBGoneT does, in which two scans are performed to
obtain ˆ. One step is for sorting theL-suffixes, and the other is
for sorting the S-suffixes.
Figure 1 shows the difference between BGone and LZone. In LZone,
with respect to thenumbers of L- and S-suffixes, the LZ77
factorization is computed either from ‰ or ˆ. Neitherexecution path
of LZone need compute ‰lms required by BGone, resulting in a faster
speed.
Figure 1. The Lempel–Ziv factorization computation processes of
BGone and LZone.
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
-
IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING
FASTER 1207
4.2. Computing SAs and SA` from T by induced sorting
The principle behind induced sorting is to induce the
lexicographic order of the unsorted suffixesfrom the sorted
suffixes, as observed in [23] as follows.
Lemma 1Given that all of the L- or S-suffixes of T are sorted,
all of the suffixes of T can be sorted inO.n/ time.
However, as further observed in [24], not every S-suffix is
useful for inducing the order of all theL-suffixes, knowing the
order of all the LMS-suffixes is already enough for inducing the
order of allof the suffixes.
Lemma 2Given that all of the LMS-suffixes of T are sorted, all
of the suffixes of T can be sorted inO.n/ time.
Based on the aforementioned Lemmas, given that all of the
LMS-suffixes of T has been sortedand stored in AŒ1::k�, SACA-K can
obtain the SA of the string T by performing the followingfour-step
procedure in O.n/ time and O.1/ workspace. In these steps, the
range of suffixes of anidentical heading character c is called a
bucket inA, denoted as bucket.c/. Each bucket is composedof at most
a sequence of L-suffixes followed by at most a sequence of
S-suffixes. Hence, bucket.c/can be divided into at most two
sub-buckets bucket`.c/ and buckets.c/ for the L- and
S-suffixes,respectively.
1. Initialize each item of AŒk C 1::n� as EMPTY.2. Scan AŒ1::k�
from right to left, and put all of the LMS-suffixes into their
buckets in A, from
the end to the head of each bucket.3. Scan A from left to right.
For each non-empty position i , if suf.T; AŒi ��1/ is an L-suffix,
then
put suf.T; AŒi � � 1/ in the left-most empty position of
bucket.T ŒAŒi � � 1�/.4. Scan A from right to left. For each
non-empty position i , if suf.T; AŒi �� 1/ is an S-suffix, put
suf.T; AŒi � � 1/ in the right-most empty position of bucket.T
ŒAŒi � � 1�/.After step (3), all of the L-suffixes have been
sorted, we need only to perform steps (1) through
(3) to compute SA`.Analogously, we can obtain the following
result.
Lemma 3Given that all of the LML-suffixes of T are sorted, all
of the suffixes of T can be sorted in O.n/time.
Thus, if SAlml has been obtained and stored in AŒ1::k�, we can
also obtain the SA of T byperforming the following four-step
procedure in O.n/ time and using O.1/ workspace.
1. Initialize each item of AŒk C 1::n� as EMPTY.2. Scan AŒ1::k�
from left to right, and put all of the LML-suffixes into their
buckets in A, from
the head to the end in each bucket.3. Scan A from right to left.
For each non-empty position i , if suf.T; AŒi �� 1/ is an S-suffix,
put
suf.T; AŒi � � 1/ into the right-most empty position of bucket.T
ŒAŒi � � 1�/.4. Scan A from left to right. For each non-empty
position i , if suf.T; AŒi ��1/ is an L-suffix, then
put suf.T ŒAŒi � � 1�/ into the left-most empty position of
bucket.T ŒAŒi � � 1�/.Obviously, we need only to perform steps (1)
through (3) to compute SAs .As described previously, both SAs and
SA` can be correctly computed in O.n/ time and O.1/
workspace.
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
-
1208 W. J. LIU ET AL.
4.3. Computing ˆ`=‰s from SA`=SAs
In this section, we transform SA`=SAs into the array based
linked list representation. Suppose thatSA`=SAs is stored in
AŒ1::k�. Let snum and lnum be the number of S-suffixes and
L-suffixes,respectively.
4.3.1. If lnum > snum, compute ‰s . In this case, we compute
SAs first, for there are enoughL-type positions to store all of the
S-suffixes. All of the suffixes in SAs must be linked from
thelexicographically smallest suffix to the largest. Thus, we can
simulate the process in SACA-K forinduced sorting the L-suffixes.
For each suf.T; SAsŒi �/, its lexicographically succeeding S-type
suf-fix suf.T; SAsŒi C 1�/ will be put in A[SAs[i]], that is,
A[SAs[i]] = SAs[i+1], A[0] = A[SAs[1]]and A[SAs[k]] = 0. ‰s is then
obtained in which A[0] is the lexicographically smallest suffix
inSAs . This can be performed by the following steps.
1. Reverse the order of the sorted S-suffixes in SAs so that all
of the S-suffixes are stored indescending order in AŒ1::k�.
2. Put all of the S-suffixes in L-type positions in A. Let p and
q point to A[k] and A[n], respec-tively. That is, p D k and q D n.
Then, scan T from right to left. If suf.T; q/ is L-type (i.e.,
theq-th position is L-type), we put suf.T; AŒp�/ in A[q], i.e.,
A[q] = A[p] and A[p] = EMPTY.After this, both p and q decreased by
one. In this way, all of the S-suffixes can be put in
L-typepositions. To determine the type of suf.T; q/, we use an
integer variable typepre to record thetype of the previously
scanned position: 1 for S-type and 0 for L-type. Because suf.T; n/
isalways an S-suffix, the n-th position of A is always S-type. So,
the variable typepre is ini-tialized as 1, and we scan T from T Œn
� 1�. Given the value of typepre , when processingsuf.T; q/, the
type of the q-th position can be immediately obtained by comparing
T Œq� andT Œq C 1�. Therefore, this step runs in O.n/ time using
O.1/ workspace.
3. Link up all of the S-suffixes to obtain ‰s . Because all of
the S-suffixes are stored in the L-typepositions of A, all of the
S-type positions are empty. We scan T from right to left again to
findall non-empty L-type positions in A. Suppose that the j1-th and
j2-th are such positions, andposition j2 is the one next to j1. Put
p = A[j1], q= A[j2], A[p] = q and then A[j1] = EMPTY.Because
position p of A must be empty, the previous operations can be
performed withoutlosing any values in SAs . In this way, for each
suf.T; i/ in SAs , we can put the succeedingsuffix in A[i], and
this process needs scan A only once. Therefore, this step runs in
O.n/ timeusing O.1/ workspace.
For example, given T D mmiissiiss$, Table II shows the result of
each step for computing ‰sfrom SAs . In this table, row ‘L/S’ gives
the type of each character. Because the number of S-typecharacters
is less, SAs is computed, and the result of each step for
computing‰s from SAs is shownin rows ‘(1)’, ‘(2)’, and ‘(3)’.
4.3.2. If lnum 6 snum, compute ˆ`. We compute SA` first. All of
the suffixes in SA`should be linked from the lexicographically
largest suffix to the smallest for induced sorting theS-suffixes.
That is, we make all of the L-suffixes an array ˆ` so that A[0]
stores the lexicograph-ically largest suffix in SA`. For each
suf.T; SA`Œi �/, its lexicographically preceding L-type suffix
Table II. An example for computing ‰s from SAs .
Index 0 1 2 3 4 5 6 7 8 9 10 11
T m m i i s s i i s s $L/S L L S S L L S S L L SSAs 11 7 3 8
4.1/ 4 8 3 7 11.2/ 4 8 3 7 11.3/ 11 8 0 3 4 7‰s 11 8 0 3 4 7
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
-
IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING
FASTER 1209
Table III. An example for computing ˆ` from SA`.
Index 0 1 2 3 4 5 6 7 8 9 10 11
T e e i i s s i i s s $L/S S S S S L L S S L L SSA` 10 6 9 5.1/
10 6 9 5.2/ 5 9 10 6 0ˆ` 5 9 10 6 0
suf.T; SA`Œi � 1�/ will be put in A[SA`[i]], i.e., A[SA`[i]] =
SA`Œi � 1�. Thus,ˆ` can be obtained.This can be carried out by the
following steps.
1. Put all of the L-suffixes in the S-type positions of A. This
step is similar to putting all of the L-suffixes in the S-type
positions of A. Note that it does not need to reverse the sorted
L-suffixesin SA`.
2. Link up all of the L-suffixes to obtainˆ`. Because all of the
L-suffixes are stored in the S-typepositions of A, all L-type
positions are empty. We scan T from right to left to find all of
thenon-empty S-type positions of A. Suppose position j1 and j2 are
such positions, and positionj2 is the next one to j1. Put p =
AŒj1�, q = AŒj2�, AŒp� = q and then AŒj1� = EMPTY. In thisway, for
each suf.T; i/ in SA`, we can put the preceding suffix in AŒi�. The
process scans Tonce only. Therefore, this step runs also in O.n/
time and O.1/ workspace.
For T D eeiissiiss$ as an example, Table III shows the result of
each step for computing ˆ` fromSA` . In this table, because the
number of L-type characters is less, SA` is computed, and the
resultof each step for computing ˆ` from SA` is shown in rows ‘(1)’
and ‘(2)’.
4.4. Computing ‰=ˆ from ‰s=ˆ` using O.1/ workspace
Firstly, we consider the situation that the L-suffixes outnumber
the S-suffixes, that is, lnum >snum. In this situation, we
should compute ‰ from ‰s . Note that all of the S-suffixes are
stored inA, ‰s is an array-based singly linked list, and ‰s[0] is
the lexicographically smallest suffix in thelist. Based on Lemma 1
in Section 4.2, we can simulate the process for sorting the
L-suffixes byaccessing to A. Thus, ‰ will be obtained finally.
To accomplish this task, we view this process as scanning all of
the buckets in A in ascendingorder and simulate the method used by
BGoneT to sort all of the S-suffixes from ˆ`. For a char-acter c
and its bucket bucket.c/, we scan bucket`.c/ first, then
buckets.c/. Four integer arrays,LbktsŒc�, LbkteŒc�, SbktsŒc�, and
SbkteŒc�, each of size j†j are required. We use LbktsŒc�
andLbkteŒc� to store the lexicographically smallest and largest
suffixes, respectively, in bucket`.c/, anduse SbktsŒc� and SbkteŒc�
to store the lexicographically smallest and largest suffixes,
respectively,in buckets.c/.
The followings are the concrete steps for sorting the L-suffixes
in LZoneT with the given ‰s .
1. Initialize each item of LbktsŒc�, LbkteŒc�, SbktsŒc�, and
SbkteŒc� as EMPTY.2. Scan ‰s once to compute SbktsŒc� and SbkteŒc�
for each bucket.c/.3. Scan all of the buckets in lexicographically
ascending order to sort and store all of the L-
suffixes in their buckets in A. For each bucket.c/, we scan from
LbktsŒc� to LbkteŒc�, thenSbktsŒc� to SbkteŒc�.
After step (2), we can obtain the start and end suffixes for
each buckets.c/. Note that all of thesuffixes in buckets.c/ have
already been linked together from the smallest to the largest
beforestep (1).
In step (3), the values of LbktsŒc� and LbkteŒc� are updated
dynamically. Suppose that we areprocessing suf.T; i/. Let j = i�1.
If suf.T; j / is an L-suffix, we put it in bucket.T Œj �/ and do
noth-ing if otherwise. Because there are bucket`.c/ and buckets.c/,
when we are scanning bucket`.c/,the type of suf.T; i � 1/ can be
determined in a constant time by comparing T Œi � 1� and T Œi
�.
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
-
1210 W. J. LIU ET AL.
For examples, suf.T; i �1/ is L-type if T Œi �1� > T Œi �
when scanning bucket`.c/, and suf.T; i �1/is L-type if T Œi � 1�
> T Œi � when scanning buckets.c/. Then, we can put the L-type
suf.T; j / inbucket.T Œj �/ according to the following steps.
(1) Check whether LbktsŒT Œj �� is empty or not.(2) If LbktsŒT
Œj �� is empty, suf.T; j / is the lexicographically smallest suffix
in bucket`.c/ of c
or the starting suffix of bucket.c/. We set LbktsŒT Œj �� =
LbkteŒT Œj �� = j .(3) If LbktsŒT Œj �� is non-empty, the end
suffix of bucket`.T Œj �/ is smaller than suf.T; j /. We
can put suf.T; j / in A by setting A[LbkteŒT Œj ��] = j and
update the end of bucket`.T Œj �/by setting LbkteŒT Œj �� = j .
In this way, each of the L-suffixes can be put in its
corresponding position in bucket`.c/ of A. Allof the suffixes in
the same bucket`.c/ can be linked together from the
lexicographically smallest tothe largest. Algorithm 2 shows the
pseudo code for computing ‰ from ‰s , and an example is givenin
Table IV for illustrating this algorithm on T D mmiissiiss$.
In this example, the work array A is reused to store the input
‰s , the output ‰ and all the tem-porary data generated during the
computation process. Initially, ‰s is sparsely stored in the
S-typepositions of A. The rows between‰s and‰ give the results of
scanning all the suffixes in increasing
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
-
IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING
FASTER 1211
Table IV. An example for computing ‰ from ‰s .
Index 0 1 2 3 4 5 6 7 8 9 10 11T m m i i s s i i s s $L/S L L S
S L L S S L L S
‰s 11 8 0 3 4 711 11 8 0 3 4 77 11 8 0 3 4 6 73 11 8 0 3 4 6 78
11 8 0 3 4 6 74 11 8 0 3 4 6 7
11 8 2 3 4 6 72 11 1 8 2 3 4 6 71 11 1 8 2 3 4 6 7
11 10 1 8 2 3 4 6 710 11 10 1 8 2 9 3 4 6 76 11 10 1 8 2 9 3 4 5
6 79 11 10 1 8 2 9 3 4 5 6 75 11 10 1 8 2 0 9 3 4 5 6 7‰ 11 10 1 8
2 0 9 3 4 5 6 7
order, where each row gives the status of A after scanning the
suffix indexed by the first column ofthis row, for example, the row
‘11’ is the status of A after scanning suf.T; 11/. Notice that the
suf-fixes in the first column between‰s and‰ are lexicographically
increasing, which is the same orderas that in the SA of T . There
are totally four buckets to be scanned, that is bucket.$/,
bucket.i/,bucket.m/ and bucket.s/. From ‰s , we obtain that
SbktsŒ$� D SbkteŒ$�=11, SbktsŒi �=7 andSbkteŒi �=4. The arrays
Lbkts and Lbkte are initialized as empty.
First, the smallest bucket bucket.$/ is scanned. There is only
one suffix belonging to this bucket,that is suf.T; 11/, which is
stored in AŒ0�. The preceding suffix of suf.T; 11/ is suf.T; 10/.
GivenT Œ10� D s being L-type, suf.T; 10/ is put in bucket.s/.
Because both LbktsŒs� and LbkteŒs� arecurrently empty, and the
previous suffix of suf.T; 10/ in SA is currently unknown, we will
not savesuf.T; 10/ to ‰ but set LbktsŒs� D LbkteŒs� D 10 to record
suf.T; 10/ instead. Later on, beforescanning bucket.s/, suf.T; 10/
is saved to ‰. That is, at the row between ‘1’ and ‘10’, when
thepreceding suffix of suf.T; 10/ is scanned and known as suf.T;
1/, suf.T; 10/ is saved to ‰.
Next, the bucket bucket.i/ is scanned, by scanning bucket`.i/
first and then buckets.i/. In thiscase, given the empty bucket`.i/,
only buckets.i/ needs to be scanned. The smallest suffix
inbuckets.i/ is given by SbktsŒi � D 7 as suf.T; 7/, its preceding
suffix is suf.T; 6/. Because T Œ6� D sis L-type, suf.T; 6/ is put
in bucket.s/. Given LbkteŒs� D 10, the previous suffix of suf.T; 6/
in SAis determined as suf.T; 10/, so we put suf.T; 6/ in AŒ10� and
underline it to indicate that it is theL-suffix newly inserted into
‰ in this step. Meanwhile, LbkteŒs� is updated as 6 to record that
thecurrent end suffix of bucket`.s/ is suf.T; 6/.
Similarly, the same method is applied to sort the other
L-suffixes at rows ‘3’ to ‘5’. Finally, weobtain the result ‰ in
the last row.
Because the sequential access to SA and putting each L-type
suffix in its correct position can becarried out in O.1/ time, the
whole process runs in O.n/ time using only one single integer array
Aand O.1/ workspace in total.
Now, we consider the situation that the S-suffixes outnumber the
L-suffixes, that is, lnum 6snum. In this case, we need to compute ˆ
from ˆ`. In Section 4.3, we presented the proceduresto compute ˆ`
in which all of the L-suffixes stored in A, ˆ` is an array-based
singly linked list,and ˆ`[0] is the lexicographically largest
suffix in the list. We can scan all of the L-suffixes andS-suffixes
in A in lexicographically descending order starting from ˆ`[0].
Based on Lemma 1, wecan sort the S-suffixes and store them in A.
Thus, ˆ can be obtained. This problem has been welladdressed in
[15, 16], so we will not discuss here. Note that in [15, 16],
BGoneT computes the ˆ`based on the LMS and all of the L-suffixes
will be linked in increasing order. So, BGoneT needs torewrite the
work array A to reverse the direction of the links before sorting
S-suffixes.
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
-
1212 W. J. LIU ET AL.
4.5. In-place computing PSV=NSV from ‰=ˆ
Goto and Bannai [15, 16] showed that NSV can be computed
in-place from ˆ. Similarly, given‰, the PSV can also be computed
in-place. In this section, we will show how to compute
PSVin-place.
Because ‰Œi� is the immediate lexicographical successor of
suf.T; i/, we can think of ‰ as anarray-based singly linked list,
which links all elements of SA from left to right. Then, we can
sim-ulate the process for sorting the L-suffixes in SACA-K by
accessing the ‰ array starting from thelexicographically smallest
suffix, ‰Œ0�. Because the suffix stored in ‰Œi� is no longer
required to bekept after it has been processed, ‰Œi� can be
rewritten to PSV Œi �. Algorithm 3 shows the pseudocode for
computing PSV from ‰.
Following corollary can be obtained from Lemma 4.1 in [21].
Corollary 1Given ‰ of an input string T , the PSV of T can be
computed from ‰ in O.n/ time and in-placeusing O.1/ workspace.
With the given PSV , we can compute NSV and the LZ77 based on
the following lemma.
Lemma 4 ([11])Given the PSV of a string T of length n, NSV Œi�
of T can be sequentially obtained for all i D1; : : : ; n in O.n/
time using O.1/ workspace excluding the PSV and T .
Thus, if the L-suffixes outnumber the S-suffixes, we can obtain
the LZ77 of a given string Twith a linear in-place algorithm which
computes ‰s first, then PSV from ‰s , and finally rewritePSV to ‰
using only one single integer array plus O.1/ extra workspace;
otherwise, the algorithmcomputes ˆ` first, then ˆ` to NSV , and
rewrite NSV to ˆ, using only one single integer arrayplus O.1/
extra workspace.
4.6. LZoneSA
In view that the SA of T is already available in some
applications, we provide another variant ofthe LZoneT called
LZoneSA to compute the LZ77 from the SA efficiently. LZoneSA is
differ-ent from LZoneT only in their ways for computing SA` or SAs
. In Algorithm 1, that is LZoneT,SA`, or SAs are computed from T .
However, given that the SA is already known in the case ofLZoneSA,
SA`; or SAs can be computed efficiently as follows. For each
character c, we countthe numbers of the L-suffixes and the
S-suffixes by scanning T from right to left, and then theranGe of
buckets.c/ or bucket`.c/ in SA. In this way, we can quickly obtain
SA` or SAs from thegiven SA.
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
-
IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING
FASTER 1213
5. PERFORMANCE EVALUATION
We compare the performance of nine programs in this evaluation
experiment: KKP1, KKP2, KKP3,BGtwo, BGoneT, BGoneSA, LZoneT,
LZoneSA and gzip (version 1.3.12), where gzip is commonlyshipped
with Linux as a file compression utility software. Both BGoneT and
LZoneT compute theLZ77 directly from T in O.n/ time using only one
single integer array. BGtwo uses two integerarrays. The SA needed
in other programs is computed by divsufsort, which was downloaded
fromhttps://code.google.com/p/libdivsufsort/.
The experiment was performed on a computer with a 2.20 GHz
Intel(R) Xeon(R) CPU E5-2407processor, 16 GiB RAM and Linux (CentOS
6.4 Final 64-bit). All programs were compiled by g++with options
‘-fomit-frame-pointer -W -Wall -Winline -DNDBUG -O3’. The running
times weremeasured in seconds, starting from after reading input
string T into the memory, and the average offive runs was
reported.
The datasets used in this experiment are available at
http://pizzachili.dcc.uchile.cl/texts.html.Table V shows the
details of the datasets. Table VI shows the runtime of the
algorithms. The run-time of KKP1 includes the writing and reading
time of the SA to and from the disk. The total time(in seconds) for
each algorithm is the sum of all the times for running the
algorithm on all the inputdata. The mean time (in seconds per MiB)
for each algorithm is the total time divided by the totalnumber of
characters of all the input data.
The results show that for computing the LZ77 from an input
string, KKP3 and gzip are on average0:39=0:17 � 1 D 1:29 and
0:39=0:15 � 1 D 1:60 faster than LZoneT, respectively. LZoneT
andLZoneSA are on average 0:49=0:39 � 1 D 0:26 and 0:47=0:3 � 1 D
0:57 faster than BGoneT andBGoneSA, respectively.
To further compare the performance of BGoneT and LZoneT in
different phases, we divide thewhole process into three phases as
follows. For BGoneT: (i) Compute SAlms from T ; (ii) Computeˆ` from
SAlms; (iii) Compute LZ77 from ˆ`. For LZoneT: (i) Compute SA`=SAs
from T ; (ii)Compute ˆ`=‰s from SA`=SAs; (iii) Compute LZ77 from
ˆ`=‰s .
Table V. Datasets used in the experiments.
Name Size (MiB) j†j Description
Proteins.200 200.00 25 Swissprot databaseEnglish.200 200.00 225
Gutenberg projectDna.200 200.00 16 Human GenomeSources.200 200.00
230 Linux and GCC codeCoreutils 195.77 236 GNU coreutils sourceCere
439.92 5 Baking yeast GenomeKernel 246.01 160 Linux kernel
sourceEinstein.en.txt 445.96 139 Wikipedia articles
Total 2127.66 - Total number of characters
Table VI. Time for computing Lempel–Ziv factorization from T for
linear-time algorithms.
Algorithm KKP1 KKP2 KKP3 BGtwo BGoneT BGoneSA LZoneT LZoneSA
gzip
Proteins.200 45.41 45.25 42.75 64.65 114.72 108.10 96.14 74.97
13.44English.200 39.98 39.89 37.39 60.23 112.18 104.36 91.30 67.69
22.11Dna.200 43.06 42.91 39.51 63.55 112.16 109.82 91.91 74.09
62.52Sources.200 29.68 29.48 27.65 47.70 84.78 78.09 67.99 52.09
11.52Coreutils 30.32 30.22 28.85 48.53 86.10 81.48 61.80 52.01
11.46Cere 82.72 82.52 80.26 128.58 232.00 224.17 185.19 136.20
135.57Kernel 38.59 38.46 36.93 64.20 123.02 115.37 92.97 68.24
15.80Einstein.en.txt 67.72 67.52 66.02 108.50 177.55 169.54 143.12
117.88 37.21
Total (s) 377.48 376.25 359.36 585.94 1042.51 990.93 830.42
643.14 309.55Mean (s/MiB) 0.18 0.18 0.17 0.28 0.49 0.47 0.39 0.30
0.15
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
https://code.google.com/p/libdivsufsort/http://pizzachili.dcc.uchile.cl/texts.html
-
1214 W. J. LIU ET AL.
Table VII shows the time consumption in each phase for computing
the LZ77 from T . BGoneTis faster for the first phase, but it took
about seven times longer than LZoneT for the second phase.In the
third phase, the time gaps between BGoneT and LZoneT are small. In
Tables VII and VIII,the speedup for each set of input data is the
ratio of the speed of LZone to that of BGone, that is,the ratio of
the time consumed by BGone to that by LZone. Table VIII gives the
time consumptionfor computing LZ77 from SA. The result shows that
once the SA has been obtained, the speed ofLZoneSA is on average
twice that of BGoneSA.
The key reason for LZone to run faster than BGone is its new way
to computeˆ` or‰s from SA`or SAs in the second phase, respectively.
For example, having obtained SAlms , BGone performsinduced sorting
in a non-sequential way to compute ˆ`, which need access LbktsŒc�
and LbkteŒc�frequently to dynamically mark the head or end of
bucket`.c/ for putting all the L-suffixes intotheir proper
positions in A given by LbktsŒc� or LbkteŒc�. This process requires
frequent randomaccesses to A. Different from BGone, LZone computes
SA` or SAs and then sequentially scans Atwice for computing ˆ` or
‰s , respectively: once to put all the L-sufixes or S-suffixes into
the S- orL-type positions in A, respectively, and another to
sequentially link all these L-suffixes or S-suffixesto produce ˆ`
or ‰s , respectively. By doing so, LZone avoids random accesses to
A and henceruns faster.
In addition to these linear time LZ77 factorization algorithms
described previously, there arealso some other non-linear-time LZ77
factorization algorithms that are practically fast and
spaceeconomical. For example, the most space economical LZ77
factorization called LZscan [25], whichis a non-linear-time LZ77
factorization algorithm, runs in O.dn/ time and O..n logn/=d/ bits
ofworking space, where the parameter d is used for the space–time
trade-off. LZscan divides T intod = dn=be fixed size blocks of
length b, then parses the blocks one by one. LZscan runs slower
Table VII. Time in each step for computing Lempel–Ziv
factorization from T by BGoneT and LZoneT.
Algorithm BGoneT LZoneT
T ! SAlms ˆ` T ! SA`=SAs ˆ`=‰sStep SAlms ! ˆ` !LZ77 Total
SA`=SAs ! ˆ`=‰s !LZ77 Total Speedup
Proteins.200 45.93 32.68 36.09 114.70 59.88 4.91 31.81 96.60
1.19English.200 41.99 34.13 36.17 112.29 56.87 4.83 29.65 91.35
1.23Dna.200 39.14 34.89 38.02 112.05 53.59 4.87 33.48 91.94
1.22Sources.200 31.52 25.35 28.09 84.96 40.13 4.54 23.48 68.15
1.25Coreutils 28.81 26.38 30.74 85.93 36.41 4.35 20.97 61.73
1.39Cere 84.82 74.26 73.08 232.16 124.91 10.76 50.25 185.92
1.25Kernel 42.90 39.44 40.73 123.07 59.14 5.80 27.97 92.91
1.32Einstein.en.txt 70.06 54.98 52.42 177.46 85.96 10.92 45.97
142.85 1.24
Total (s) 385.17 322.11 335.34 1042.62 516.89 50.98 263.58
831.45 1.25Mean (s/MiB) 0.18 0.15 0.16 0.49 0.24 0.02 0.12 0.39
1.26
Table VIII. Time for computing Lempel–Ziv factor-ization from SA
by BGoneSA and LZoneSA.
Algorithm BGoneSA LZoneSA Speedup
Proteins.200 71.99 37.26 1.93English.200 73.09 35.60 2.05Dna.200
76.49 39.59 1.93Sources.200 55.90 28.86 1.94Coreutils 56.80 26.16
2.17Cere 154.39 63.69 2.42Kernel 83.38 34.86 2.39Einstein.en.txt
113.78 59.19 1.92
Total (s) 685.83 325.21 2.11Mean (s/MiB) 0.32 0.15 2.11
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
-
IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING
FASTER 1215
Table IX. Time for computing Lempel–Ziv factorization fromT for
both linear-time and non-linear-time algorithms.
Algorithm LZscan ISA6s LZoneT LZoneSA
Proteins.200 229.41 62.04 96.14 74.97English.200 188.30 52.34
91.30 67.69Dna.200 206.00 54.40 91.91 74.09Sources.200 191.08 38.03
67.99 52.09Coreutils 71.53 31.24 61.80 52.01Cere 206.79 85.49
185.19 136.20Kernel 73.86 40.04 92.97 68.24Einstein.en.txt 93.65
68.75 143.12 117.88
Total (s) 1261.52 423.33 830.42 643.14Mean (s/MiB) 0.59 0.20
0.39 0.30
when d is increased. ISA6s [22], which is another
non-linear-time LZ77 factorization algorithm andruns in O.n log �/
time and .1 C �/n logn + n + O.� logn/ bits of space, seem to be
faster thanLZone, but it need much more space than LZone. More
details for these two algorithms are availableat
https://www.cs.helsinki.fi/group/pads/lz77.html. Table IX shows the
running times of these twonon-linear-time algorithms compared with
that of LZone. In this experiment, the parameter of b forLZscan is
chosen such that LZscan uses approximately the same space as that
of LZone. The valueof b for LZscan on each dataset in this table,
arranging from top to bottom, are {29, 29, 29, 29, 29,65, 36, 66}
in unit of MiB.
For the space succinct algorithms LZscan, LZoneT, and LZoneSA,
we see that both LZoneT andLZoneSA run faster than LZscan except
for the last two datasets. In order to have LZscan run ina speed
almost equal to that of LZoneT, the values of b are chanGed to
{101, 95, 100, 100, 64,95, 22, 5}. That is, b increases for most
datasets except the last two. The reason is due to that,for LZscan
on a given n, the time is O.dn/ D O.n2=b/ and the working space is
O..b logn/=n/bits, which means the bigger b the faster speed and
the more space. In Table IX, LZscan runsfaster than LZoneT only for
the last two datasets. As a result, b is reduced only for the last
twodatasets from {36, 66} to {22, 5} to slower LZscan and increased
for all the other datasets toaccelerate LZscan.
6. CONCLUSION
We presented a linear-time algorithm called LZone to compute the
LZ77 for an input string of aconstant alphabet, which requires only
n words of workspace. LZone has two variations, LZoneTand LZoneSA.
The former directly computes the LZ77 from the input string T ,
while the lattercomputes the LZ77 from the suffix array of T .
Given T and/or its suffix array, we can obtainthe integer arrays ˆ`
or ‰s . While ˆ` or ‰s is available, ˆ or ‰ can be computed by
simu-lating the sorting process of SACA-K. Then, NSV or PSV can
also be further computed fromˆ or ‰. Finally, the LZ77 can be
obtained by either NSV or PSV . Note that BGoneT andLZoneT compute
LZ77 directly from string T , which do not need to compute the
whole SA ofstring T in advance. Thus, more space can be saved when
one needs to compute the LZ77 of alarge-scale dataset.
The difference between BGone and LZone is that BGone first
computes SAlms , while LZone firstcomputes SA`=SAs . BGone uses the
fact that at most one of two neighboring suffixes is LMS,
whileLZone considers the number of the L-suffixes and S-suffixes
may not both exceed bn=2c. In thewhole computation, a number of
techniques are employed to rewrite the various auxiliary
integerarrays from one to another in-place and in linear-time. Our
experimental study shows that in practiceLZone runs much faster
than BGone with the same space complexity. While LZone gives a
fasterway for computing the LZ77 in a small space, we expect that
our program used in the experimentalstudy can be better engineered
by experts for further acceleration.
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
https://www.cs.helsinki.fi/group/pads/lz77.html
-
1216 W. J. LIU ET AL.
ACKNOWLEDGEMENTS
DEGP of China, The Research Grant Council of Hong Kong SAR;
contract/grant number DEGP2014KTSCX007, GRF 810012.
REFERENCES
1. Ziv J, Lempel A. A universal algorithm for sequential data
compression. IEEE Transactions on Information Theory1977;
23(3):337–343. DOI: 10.1109/TIT.1977.1055714.
2. Bell T, Kulp D. Longest-match string searching for Ziv–Lempel
compression. Software: Practice and Experience1993; 23(7):757–771.
DOI: 10.1002/spe.4380230705.
3. Thies W, Hall S. Manipulating lossless video in the
compressed domain. Proceedings of the 17th ACM
InternationalConference on Multimedia, MM ’09, ACM, New York, NY,
USA, 2009; 331–340.
4. Fraser CW. An instruction for direct interpretation of
LZ77-compressed Programs. Software: Practice and Experi-ence 2006;
36(4):397–411. DOI: 10.1002/spe.702.
5. Abel J. Incremental frequency count—a post bwt-stage for the
burrows-wheeler compression algorithm. Software:Practice and
Experience 2007; 37(3):247–265. DOI: 10.1002/spe.763.
6. Langiu A. On parsing optimality for dictionary-based text
compression—the zip case. Journal of Discrete Algorithms2013;
20(0):65–70. DOI: 10.1016/j.jda.2013.04.001.
7. Ferragina P, Manzini G. Indexing compressed text. Journal of
the ACM 2005; 52(4):552–581. DOI: 10.1145/1082036.1082039.
8. Claude Francisco, Fariña Antonio, Martínez-Prieto Miguel A.,
Navarro Gonzalo. Indexes for highly repetitivedocument collections.
Proceedings of the 20th ACM International Conference on Information
and KnowledgeManagement, CIKM ’11, ACM, New York, NY, USA, 2011;
463–468.
9. Gagie T, Gawrychowski P, Kärkkäinen J, Nekrich Y, Puglisi SJ.
A faster grammar-based self-index. In Language andautomata theory
and applications, vol. 7183, Dediu AH, Martn-Vide C (eds)., Lecture
Notes in Computer Science.Springer-Verlag: Berlin, Germany, 2012;
240–251.
10. Gagie T, Gawrychowski P, Kärkkäinen J, Nekrich Y, Puglisi
SJ. Lz77-based self-indexing with faster pattern match-ing. In
Combinatorial Pattern Matching, vol. 8392, Pardo A, Viola A (eds).,
Lecture Notes in Computer Science.Springer-Verlag: Berlin, Germany,
2014; 731–742.
11. Kärkkäinen J, Kempa D, Puglisi SJ. Linear time Lempel–Ziv
factorization: Simple, fast, small. In CombinatorialPattern
Matching, vol. 7922, Fischer J, Sanders P (eds)., Lecture Notes in
Computer Science. Springer-Verlag: Berlin,Germany, 2013;
189–200.
12. Chen G, Puglisi SJ, Smyth WF. Fast and practical algorithms
for computing all the runs in a string. In CombinatorialPattern
Matching, vol. 4580, Ma B, Zhang K (eds)., Lecture Notes in
Computer Science. Springer-Verlag: Berlin,Germany, 2007;
307–315.
13. Al-Hafeedh A, Crochemore M, Ilie L, Kopylova E, Smyth WF,
Tischler G, Yusufu M. A comparison ofindex-based Lempel–Ziv LZ77
factorization algorithms. ACM Computing Surveys 2012; 45(1):5:1–5.
DOI:10.1145/2379776.2379781.
14. Kärkkäinen J, Manzini G, Puglisi SJ. Permuted
longest-common-prefix array. In Combinatorial Pattern Matching,vol.
5577, Kucherov G, Ukkonen E (eds)., Lecture Notes in Computer
Science. Springer-Verlag: Berlin, Germany,2009; 181–192.
15. Goto K, Bannai H. Space Efficient Linear Time Lempel–Ziv
Factorization on Constant Size Alphabets, 2013.Available from:
http://arxiv.org/abs/1310.1448 [last accessed 31 March 2014].
16. Goto K, Bannai H. Space efficient linear time Lempel–Ziv
factorization for small alphabets. Proceedings of the 2014Data
Compression Conference, DCC ’14, IEEE Computer Society, Washington,
USA, 2014; 163–172.
17. Nong G. Practical linear-time o(1)-workspace suffix sorting
for constant alphabets. ACM Transactions on InformationSystems
(TOIS) 2013; 31(3):15:1–15. DOI: 10.1145/2493175.2493180.
18. Skibiński P, Grabowski S, Deorowicz S. Revisiting
dictionary-based compression. Software: Practice and Experi-ence
2005; 35(15):1455–1476. DOI: 10.1002/spe.678.
19. Skibiński P, Grabowski S, Swacha J. Effective asymmetric
XML compression. Software: Practice and Experience2008;
38(10):1027–1047. DOI: 10.1002/spe.859.
20. Crochemore M, Ilie L. Computing longest previous factor in
linear time and applications. Information ProcessingLetters 2008;
106(2):75–80. DOI: 10.1016/j.ipl.2007.10.006.
21. Goto K, Bannai H. Simpler and faster Lempel Ziv
factorization. Proceedings of the 2013 Data CompressionConference,
DCC ’13, IEEE Computer Society, Washington, USA, 2013; 133–142.
22. Kempa D, Puglisi SJ. Lempel–Ziv factorization: Simple, fast,
practical. In 2013 Proceedings of the Fifteenth Work-shop on
Algorithm Engineering and Experiments (ALENEX), Sanders P, Zeh N
(eds)., ALENEX ’13. SIAMPub:Philadelphia, USA, 2013; 103–112.
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
http://arxiv.org/abs/1310.1448
-
IMPROVING A LIGHTWEIGHT LZ77 COMPUTATION ALGORITHM FOR RUNNING
FASTER 1217
23. Ko P, Aluru S. Space efficient linear time construction of
suffix arrays. In Combinatorial Pattern Matching, vol.
2676,Baeza-Yates R, Chávez E, Crochemore M (eds)., Lecture Notes in
Computer Science. Springer-Verlag: Berlin,Germany, 2003;
200–210.
24. Nong G, Zhang S, Chan WH. Two efficient algorithms for
linear time suffix array construction. IEEE Transactionson
Computers 2011; 60(10):1471–1484. DOI: 10.1109/TC.2010.188.
25. Kärkkäinen J, Kempa D, Puglisi SJ. Lightweight Lempel–Ziv
parsing. In Proceedings of 12th Symposium on Exper-imental
Algorithms, Bonifaci V, Demetrescu C, Marchetti-Spaccamela A
(eds)., SEA ’13. Springer-Verlag: Berlin,Germany, 2013;
139–150.
Copyright © 2015 John Wiley & Sons, Ltd. Softw. Pract.
Exper. 2016; 46:1201–1217DOI: 10.1002/spe
Improving a lightweight LZ77 computation algorithm for running
fasterSummaryIntroductionPreliminariesSuffix and suffix array and
Lempel–Ziv factorizationNext and previous smaller valuesLazy LZ
factorization
Prior ArtsKKP3, KKP2 and KKP1BGone
Our WorkLZoneTComputing SAs and SA from T by induced
sortingComputing /s from SA/SAsIf lnum > snum, compute sIf lnum
snum, compute
Computing / from s/ using O(1) workspaceIn-place computing
PSV/NSV from / LZoneSA
Performance EvaluationConclusionREFERENCES