-
1
Parametric context adaptive Laplace distributionfor multimedia
compression
Jarek DudaJagiellonian University, Golebia 24, 31-007 Krakow,
Poland, Email: [email protected]
Abstract—Data compression often subtracts prediction andencodes
the difference (residue) e.g. assuming Laplace distribu-tion, for
example for images, videos, audio, or numerical data.Its
performance is strongly dependent on the proper choice ofwidth
(scale parameter) of this parametric distribution, can beimproved
if optimizing it based on local situation like context.For example
in popular LOCO-I [1] (JPEG-LS) lossless imagecompressor there is
used 3 dimensional context quantized into365 discrete possibilities
treated independently. This article dis-cusses inexpensive
approaches for exploiting their dependencieswith autoregressive
ARCH-like context dependent models forparameters of parametric
distribution for residue, also evolvingin time for adaptive case.
For example tested such 4 or 11parameter models turned out to
provide similar performanceas 365 parameter LOCO-I model for 48
tested images. Besidesmaller headers, such reduction of number of
parameters canlead to better generalization. In contrast to context
quantizationapproaches, parameterized models also allow to directly
usehigher dimensional contexts, for example using informationfrom
all 3 color channels, further pixels, some additional
regionclassifiers, or from interleaving multi-scale scanning - for
whichthere is proposed Haar upscale scan combining advantagesof
Haar wavelets with possibility of scanning exploiting
localcontexts.
Keywords: data compression, LOCO-I, parametric dis-tribution,
context dependence, non-stationary time series,multi-scale
scanning
I. INTRODUCTION
Many types of data statistically agree with specific para-metric
distributions, like Gaussian distribution through thelaw of large
numbers, or Laplace distribution popular indata compression as it
agrees with statistics of errors fromprediction (residues). Their
parameters can often be inexpen-sively estimated, and storing them
in a header is much lessexpensive than e.g. entire probability
distribution on somequantized set of represented values. Parametric
distributionssmoothen between discretized possibilities,
generalizing sta-tistical trends emerging in a given type of
data.
However, for example due to randomness alone, statisticsof real
data usually has some distortion from such idealiza-tion. Directly
storing counted frequencies can exploit thisdifference, gaining
asymptotically Kullback-Leibler diver-gence bits/value - at cost of
larger header. Data compres-sors need to optimize this minimum
description length [2]tradeoff between model size and entropy it
leads to.
Figure 1. Comparison of some discussed models for 48 grayscale 8
bit512x512 images presented in Fig. 2. Top left: first we need to
predictpixel value based on the current context: already decoded 4
neighboringpixels c = (A,B,C,D). This predicted µ(c) is used as the
center ofLaplace distribution, which is estimated as median:
minimizes l1 distance.Hence, presented evaluation uses average |x −
µ(c)| for 4 approaches:LOCO-I predictor (red), simple average
(green), least squares parameters forcombined images (orange), and
least squares parameters chosen individuallyfor each image (blue) -
the last one gives the lowest residues so it is usedfurther. Top
right: bits/pixel for encoding its residues (r = x − µ(c))using
centered (µ = 0) Laplace distribution of width (scale parameter)b
modeled in various ways. Red: LOCO-I model with 365
parameterscorresponding to quantized context: (|C −A|, |B −C|,
|D−B|). Green:single b chosen individually (MLE) for each image.
Orange: discussed here4 parameter model, written at the bottom
left, blue: discussed later 11parameter model. Bottom: differences
of these values for the two models.The evaluation assumes accurate
entropy coding (AC/ANS) and neglectsheaders - including them would
worsen especially LOCO-I evaluation ifstoring all 365
parameters.
In practice, instead of a single e.g. Laplace distributionto
encode residues (errors of predictions) for the entireimage, we
would like to make its parameters dependent onlocal situation -
through context dependence like in Markovmodelling, or adaptivity
as for non-stationary time series.
The possibility to directly store all values fades awaywhen
increasing dimension of the model - both dueto size growing
exponentially with dimension, but alsounderrepresentation. Going to
higher dimensions requiresfinding and exploiting some general
behaviour, for examplethrough parametrizations, as in examples
presented in Fig. 1.
LOCO-I[1] mixes both philosophies: uses parametricprobability
distributions, which scale parameter (width of
arX
iv:1
906.
0323
8v4
[ee
ss.I
V]
14
Oct
201
9
-
2
Laplace distribution) depends on 3 dimensional contextquantized
into 365 possibilities treated independently - ne-glecting their
dependencies. Such approach is useful forlow dimensional contexts,
however, it becomes impracticalif wanting to use higher dimensional
context, e.g.: usinginformation from all 3 color channels, further
pixels thanthe nearest neighbors, or from some region classifiers
togradually transit between e.g. models for smooth regionslike sky,
to complex textures like treetop. Finally contexts ofmuch higher
dimension appear in multiscale interlaced scan-ning like in FLIF
[3] compressor: progressively improvingquality, rather only
parametric models can directly work onits high dimensional
contexts.
This article discusses such parametric-parametric models:choose
parameters of e.g. Laplace distribution as a paramet-ric function
of the context, like through a linear combination,or generally e.g.
neural networks. Its example are ARMA-ARCH [4] models popular in
economics: choosing squaredwidth of Gaussian distribution as a
linear combination ofrecent squared residues, e.g. σ2t = β0 +
β1�
2t−1.
These parameters can be universal e.g. default for varioustypes
of classified regions, or optimized individually bycompressor and
stored in the header. For the latter purposewe will focus on least
squares estimation due to its lowcost. Presented test results are
for such estimation, a costlyadditional optimization might slightly
improve performance.
While we will mostly focus on such static models: assum-ing
constant joint distribution of (value, context),
mentionedalternative are adaptive models: assuming
non-stationarytime series, evolving joint distribution. It requires
additionalcost to update parameters of the model, for example
per-forming likelihood optimization step while processing
eachvalue. It has two advantages: can learn model from
alreadydecoded data even without header, and can flexibly adapt
tolocal behavior e.g. of an image. Appendix discusses secondorder
approaches for such online optimization.
In literature there are also considered much more costlymodels,
like using neural networks for predicting probabilitydistribution
of succeeding pixels ([5], [6]). In the discussedphilosophy,
instead of directly predicting probability of eachdiscrete value
e.g. with softmax, we can use such neuralnetworks to directly
predict context dependent parametersof some parametric distribution
for the new pixel. Suchsimplification should allow to use much
smaller neuralnetworks, bringing it closer to practical application
in datacompression.
II. PARAMETRIC-PARAMETRIC DISTRIBUTIONS
We would like to model conditional probability distribu-tion
Pr(x|c) of the new value x ∈ R, based on some local d-dimensional
context c = (c1, . . . , cd) ∈ C ⊂ Rd, in practicebounded e.g. to a
cube like C = [0, 1]d here. In LOCO-I image compressor this context
are 4 neighboring alreadydecoded pixels (c = (A,B,C,D) as in Fig.
1). Both value
Figure 2. Dataset of 48 grayscale 8 bit 512x512 images used in
tests.Source: http://decsai.ugr.es/cvg/CG/base.htm .
and context are rather discrete through some quantization,but it
is useful to model them as real values - especiallywanting to
exploit continuity of their behavior.
Modelling general continuous conditional distributions isa
difficult task - requires techniques like quantile regres-sion [7].
or hierarchical correlation reconstruction [8], [9].However, the
situation becomes much simpler if focusingon simple parametric
distributions for the predicted distribu-tion. Another standard
simplification is separately modellingthe center of the
distribution with predictor µ(c), and theremaining parameter(s)
θ(c) of centered distribution forr = x−µ(c) residue, usually single
scale parameter definingwidth:
r = x− µ(c) residue from ρθ(c) density (1)
We will mainly focus on standard for such applicationsLaplace
distribution and modeling its width parameter b:
ρµb(x) =1
2bexp
(−|x− µ|
b
)ρb(r) =
1
2bexp
(−|r|b
)(2)
which MLE parameters for (x1, . . . , xn) sample are:
µ = median of {xi} b = 1n
n∑i=1
|xi − µ| (3)
http://decsai.ugr.es/cvg/CG/base.htm
-
3
LOCO-I has a fixed specialized predictor. Then chooseswidth
parameter θ(c) ≡ b(c) as locally constant inside 365regions for
quantized |C − A|, |B − C|, |D − B| context,each into 9 ranges of
nearly equal population. This way wecan perform estimation
independently for each region, andfinally e.g. store in the header
the 365 parameters.
Quantization of context neglects dependencies betweenthese
regions and can be practical rather only for low di-mensional
contexts - both due to the number of possibilitiesgrowing
exponentially with dimension, but also underrepre-sentation of many
such contexts. To resolve it, we will focushere on parameterized
models for these parameters:
µ(c) ≡ µα(c) for α ∈ Rdα predictor
θ(c) ≡ θβ(c) for β ∈ Rdβ e.g. scale parameter
Choosing µα(c) and θβ(c) family of functions optimized fora
given type of problems is a difficult question. Like ARCH,unlike
LOCO-I, we will focus on using linear combinationsof some chosen f,
g functions:
µα(c) = α1f1(c) + α2f2(c) + . . .+ αdα fdα(c) (4)
θβ(c) = β1g1(c) + β2g2(c) + . . .+ βdβ gdβ (c) (5)
The latter might need additional e.g. max(θ, 0.001) if posi-tive
values are required and some of β are negative. We canalternatively
use more sophisticated nonlinear models likeneural networks.
A. Context dependence
Choosing some µα(c) and θβ(c) family of functions,we can
optimize α, β (or e.g. neural network parameters)for given (x1, . .
. , xn) values and (c1, . . . , cn) contexts, forexample maximizing
likelihood (MLE):
(α, β) = argminα,β
n∑i=1
log(ρθβ(ci)(x
i − µα(ci)))
(6)
To simplify this optimization at cost of suboptimality, wecan
split it into predictor and the remaining as in Fig. 1.
This way we can first optimize parameters of predictore.g. using
some distance d:
α = argminα
n∑i=1
d(xi, µα(ci)) (7)
for example using d(x, y) = (x− y)2 least squares distancewe are
looking for predictor of expected value - appropriatee.g. for
Gaussian distribution (or polynomial coefficients in[9]). For
Laplace distribution it is more appropriate to used(x, y) = |x − y|
for predictor of median. However, unlessheavy tails case,
optimization of both gives nearly the samepredictor, so it is safe
to use least squares optimization whichis computationally less
expensive.
Having optimized predictor, we can calculate residuesri = xi −
µα(ci) and separately optimize β using them.
Especially for scale parameter, MLE estimator is oftenaverage
over some simple function of values, for exampleb = average |r| for
Laplace distribution (θ ≡ b), σ2 =average r2 for Gaussian
distribution (θ ≡ σ2), or generallyaverage |r|κ for exponential
power distribution (θ ≡ bκ).Average is estimator of expected value,
what allows forpractical optimization of β using least squares
(analogouslye.g. for neural networks):
β = argminβ
n∑i=1
(|ri| − θβ(ci)
)2for Laplace: θ ≡ b (8)
β = argminβ
n∑i=1
((ri)2 − θβ(ci)
)2for Gaussian: θ ≡ σ2
Such parameters can be optimized for a dataset, forexample for
different regions using some segmentation, andthen used as default.
Alternatively, compressor can optimizethem individually e.g. for a
given image and store parametersin the header.
B. Adaptivity
Instead of storing model parameters in the header, alterna-tive
approach is starting from some default parameters andadapting them
based on the processed data, also for betteragreement with varying
local statistics e.g. of an image.Such adaptation brings additional
cost, dependence on localsituation can be alternatively realized by
using some regionclassifier/segmentation and separate models for
each class, orusing outcome of such local classifier as additional
context- choosing the best tradeoffs is a difficult question.
For adaptation we can treat the upper index as timeand use time
dependent parameters starting from somee.g. default initial choice
for t = 0. For example withoutcontext dependence, we could just
replace average withexponential moving average for Laplace
distribution andsome η, ν ∈ (0, 1) learning rates:
µt+1 = νµt + (1− ν)xt bt+1 = ηbt + (1− η)|xt − µt|
Generally we could use for example gradient descentwhile
processing each value to optimize parameters towardlocal statistics
for combined (α, β) using (6), or in split form:
rt = xt − µαt(ct) residue from ρθβt density
αt+1 = αt − ηα∂d(xt, µα(c
t))
∂α(αt)
βt+1 = βt + ηβ∂ log(ρθβ(ct)(r
t))
∂β(βt) (9)
where d is distance as previously. For β the above
gradientascend optimizes likelihood, ηα, ηβ define adaptation
rate.
Using first order method is not sufficient for a properchoice of
step size, suggesting to use also second derivativeand Newton’s
method (e.g. ∀i θt+1i = θti−∂if(θt)/∂iif(θt))- the Appendix
discusses such general approaches.
-
4
C. Exponential power distribution
Data compression usually focuses on Laplace distribution,but
real data might have a bit different statistics, especiallyheavier
tails. It might be worth to consider more generalfamilies,
especially exponential power distribution [10]:
ρκµb(x) =κ−1/κ
2 bΓ(1 + 1/κ)e−
1κ (|x−µ|b )
κ
(10)
It covers both Laplace (κ = 1) and Gaussian (κ = 2, b ≡
σ)distribution. Estimating κ is costly, but we can fix it basedon a
large dataset and e.g. segment type. Then estimation ofµ, b is
analogous, also for context dependence like in 8:
µ = argminµ
n∑i=1
|xi − µ|κ b =
(1
n
n∑i=1
|xi − µ|κ)1/κ
β = argminβ
n∑i=1
(|ri|κ − θβ(ci)
)2for θ ≡ bκ (11)
Here is a simple example of its adaptive estimation for η, ν
∈(0, 1) learning rates:
µt+1 = ν µt + (1− ν)xt
θt+1 = η θt + (1− η) |xt − µt|κ for θ ≡ bκ (12)
In data compression we can have prepared entropy codingtables
for such fixed κ and some optimized discretized setof scale
parameter b.
D. Adaptive least-squares linear regression∗
∗This subsection expands adaptivity to linear regression -
forcompleteness and to connect some concepts, however, it might
betoo costly for data compression and is not used further
(yet).
Above (12) formula for b can be seen as obtained fromonline
adaptive ML estimation: instead of standard ”static”estimation of
constant parameters based on the entire sample,we perform ML
estimation separately for every moment intime - using only its past
information, weakening influenceof old values e.g. with exponential
moving average. This waywe optimize parameters separately for every
time, instead ofstandard: finding a single compromise for all of
them.
Specifically, we get (12) formula for b if maximizing
lT =∑t
-
5
III. PRACTICAL LAPLACE EXAMPLE AND EXPERIMENTS
Let us now focus on LOCO-I lossless image compressionsetting:
context are 4 already decoded neighboring pixels:c = (A,B,C,D) on
correspondingly (left, up, left-up, right-up) positions as in
diagram in Fig. 1.
A. Predictor µ(c)
LOCO-I uses a fixed predictor (c = (A,B,C,D)):
µ(c) =
min(A,B) if C ≥ max(A,B)max(A,B) if C ≤ min(A,B)A+B − C
otherwise
(19)
Simpler popular choices are e.g. (A+B)/2 or A+B −C.A standard
way for designing such predictors is polynomialinterpolation, e.g.
in Lorenzo predictor [11]: fitting somepolynomial to the known
values and calculating its value inthe predicted position, getting
a linear combination.
We can also directly optimize it for a dataset. For exampleleast
squares optimization using combined 48 images (Fig.2) gives
(rounded to 2 digits, weights sum to 1):
µ(c) = 0.57A+ 0.48B − 0.2C + 0.15D
Alternatively, we can optimize these weights individuallyfor
each image by compressor and store in the header -Fig. 1 contains
comparison for various approaches using l1
distance as we would like to estimate median for
Laplacedistribution. Such individual least squares optimization
turnsout always superior there (blue points), LOCO-I predictor
forsome images is much worse than the remaining.
Tested inexpensive least squares optimizer uses directlythe dα =
4 functions: f1(c) = A, f2(c) = B, f3(c) =C, f4(c) = D in 4
notation. We build n × dα matrix Pfrom them: Pij = fj(ci), and x =
(x1, . . . , xn) vector. Thenthe optimal parameters are obtained
using pseudo-inverse (asderived (15) for equal weights w):
α = argminα‖Pα− x‖22 = (P †P )−1P †x (20)
For further tests there were used residues from individualleast
squares optimization for each image: r = x− Pα.
B. Context dependent scale parameter b(c)
Having the residues, LOCO-I would divide |C−A|, |B−C|, |D − B|
into 9 ranges each, having nearly equal pop-ulation. Including
symmetry it leads to division into (93 +1)/2 = 365 contexts. For
each of them we independentlyestimate scale parameter b of Laplace
distribution.
Here we would like to model b as a linear combination(5) of some
functions (gj(c))j=1..dβ of the context. Thechoice of these
functions is difficult and essentially affectscompression ratios.
They should contain ”1” for the inter-cept term. Then, in analogy
to LOCO-I, the considered 4
Figure 3. Top: probability density of b parameters for all
images, LOCO-Iand discussed 4 parameter model, assuming the models
are estimated andstored individually for each image. Three most
characteristic images aremarked as their numbers. Bottom left: such
densities if combining all imagesinto one - while huge LOCO-I
number of parameters can usually learn betterindividual images than
4 parameter model, it has worse generalization - isinferior when
combining different types of patterns. Bottom right: penaltyof
using power-of-2 Golomb coding for various b parameters. We can get
≈2% improvement if switching to arithmetic coding or asymmetric
numeralsystems, however, especially for LOCO-I it would require
larger headersdue to needed better precision of b.
parameter model uses the following linear combination
(forconvenience enumerated from 0):
b(c) = β0+β1|C−A|0.8+β2|B−C|0.8+β3|D−B|0.8 (21)
There is a freedom of choosing above power and em-pirically ≈
0.8 has turned out to provide the best likeli-hood/compression
ratio - corresponds well to linear behaviorof b. This choice leads
to all the coefficients β turn outpositive in experiments - we have
some initial β0 width,growing with increased gradients in the
neighboring pixels.Hence there is no possibility of getting
negative b this way,which would make no sense.
Having chosen such e.g. dβ = 4 functions, we build n×dβmatrix
from them Sij = gj(ci), and residue vector |r| =(|r1|, . . . ,
|rn|). Then we can use least squares optimization:
β = argminβ‖Sβ − |r|‖22 = (S†S)−1S†|r| (22)
Figure 3 contains comparison of density of predictedscale
parameters b for individual images (top) for LOCO-I approach and
the above 4 parameter model - the latter issmoother as we could
expect, but generally they have similarbehavior. Bottom left of
this figure contains comparison forcombining all images, and
compression ratios showing bettergeneralization of these low
parameter models.
The second considered: dβ = 11 parameter model extendsabove
basis by the following arbitrarily chosen 7 functions:symmetric
describing intensity of neighboring pixels, andevaluating the
second derivative:
(A− 0.5)4, (B − 0.5)4, (C − 0.5)4, (D − 0.5)4
|C − 2B +D|0.1, |A− 2C +B|0.1 (23)
-
6
where again powers were chosen empirically to get the
bestlikelihood/compression ratio. In contrast to 4 parametermodel,
this time we get also negative β coefficients, leadingto negative
predicted b. To prevent that, there was finallyused max(b, 0.001)
width of Laplace distribution.
The used functions were chosen arbitrarily by
manualoptimization, some wider systematic search should
improveperformance. For example in practical implementationsabove
power functions would be rather put into tables, whatallows to use
much more complex functions, like given bystored values on some
quantized set of arguments. It wouldallow to carefully optimize
such tabled functions based ona large set of images.
The above was for Laplace distribution. For more
generalexponential power distribution, there should be used |r|κ
in(22) instead of |r|, and the prediction Sβ like (21) gives
bκ.
C. Entropy coding, penalty of Golomb coding
Laplace distribution is continuous, to encode values fromit we
need to quantize it to approximately geometric distri-bution, which
values are transformed into bits using someentropy coding.
LOCO-I uses power-of-2 Golomb coding: instead of real
bcoefficient, it optimizes M = 2m parameter, then x is storedas
bx/Mc using unary coding, and mod (x,M) is storeddirectly as bits.
This way it requires 2bx/Mc+ 1 +m bitsto store unsigned x. Signed
values are stored as position in0, 1,−1, 2,−2, . . . order.
Ideally, symbol of probability p carries log2(1/p) bitsof
information, leading to asymptotically Shannon entropybits/symbol.
Optimal parameter power-of-two Golomb cod-ing is worse by a few
percents for used here b valuesas shown in Fig. 3. One reason is
this sparse M = 2m
quantization of parameters. More important, especially forsmall
b, is most of probability going to 0 quantized value,what can
correspond to lower than 1 bit of informationalcontent. In
contrast, prefix codes like Golomb need to useat least 1 bit per
symbol.
Replacing power-of-2 Golomb coding with an accurateentropy coder
like arithmetic coding (AC) or asymmetricnumeral systems (ANS), we
can improve compression ratioby ≈ 2%. In this case we also need
some quantization of bparameter - we can have prepared entropy
coding tables forsome discredited space of possible parameters.
D. Multi-scale interleaving
In standard scanning line by line we have context onlyfrom half
of the plane, only guessing what will happen fromthe decoded side.
It can be improved in multi-scale interleav-ing, showing gains e.g.
in FLIF [3] compressor, where wecan use lower resolution context
from all directions due toprogressive decoding in multiple scans,
like visualized inFig. 4.
However, we can see that context information becomesmuch more
complex here: high dimensional, varying withthe scan number. Even
reducing it by some arbitrary av-eraging, it is still rather too
large for context quantizationapproaches like in LOCO-I. Discussed
here parametric ap-proaches have no problem with direct use of such
highdimensional contexts, modelling parameters as e.g. a
linearcombination of a chosen family of functions, with param-eters
chosen e.g. by inexpensive least squares optimizationand stored in
the header. Alternatively more complex modelscan be used instead,
like neural networks.
This Figure also proposes combination with Haar waveletsfor
hopefully improved performance - splitting decodinginto k cycles,
each improving resolution twice, and beingcomposed of a few scans,
e.g. 3 for grayscale, or 9 for3 colors - each providing a single
degree of freedom perblock for the upscaling. Such decomposition
into e.g. 9 scansclearly leaves an opportunity for optimization,
starting withthe choice of color transformation.
Assuming some scale invariance of images, similar mod-els can be
used for different cycles here, for example we cantreat the number
of cycle (defining scale) as an additionalparameter.
Figure 4. Top: conventional multi-scale interleaved scanning [5]
(e.g.FLIF compressor [3]): scan over succeeding sub-lattices for
progressivedecoding, and most importantly: to provide better local
context for laterdecoded pixels. Bottom: proposed Haar upsample
scanning which com-bines advantages of Haar wavelets [12] with
exploitation of local contextdependence. First (scan 0) we decode
low resolution image: averages over2k × 2k size blocks, using
decoded neighboring block averages as thecontext. Then in each
cycle (scan 1,2,3) we decode the 3 missing values(for grayscale, 9
for RGB) to improve the resolution twice: e.g.
horizontaldifferences in scan 1, then vertical differences in two
positions in scan 2 and3. After k such cycles we reach 1× 1 blocks
- completely decoded image.The context of already decoded local
information is high dimensional, ofdifferent type for each scan and
level. While it is a problem for LOCO-I likecontext quantization,
parametric models can easily handle it, for exampleusing µs(c)
=
∑i α
si ci predictor, where s denotes the type of scan - its
parameters α can be inexpensively e.g. MSE optimized and stored
in theheader. Some modification options are e.g. splitting values
into higher andlower bits for separate scans [6], or using
fractal-like (tame twindragon)blocks by modifying translation
vectors for hexagonal block lattice [13].
-
7
IV. CONCLUSION AND FURTHER WORK
Parametric models allow to successfully exploit trendsin
behavior, also for context dependence and evolution ofparametric
distributions. Thanks to generalization, a few pa-rameter model can
provide a better performance than treatingall possibilities as
independent - neglecting dependencies be-tween them. Wanting to
exploit higher dimensional contexts,e.g. for 3 colors, further
pixels, region classifiers or multi-scale scanning, parametric
models become a necessity as thenumber of discretized possibilities
would grow exponentiallywith dimension.
There were presented and tested very basic possibili-ties,
leaving many improvement opportunities, starting withchoice of
contexts and functions, or using other parametricdistributions like
exponential power distribution. Used leastsquares optimization is
inexpensive enough to be used bycompressor to individually optimize
parameters for each im-age. For example choosing some general
default parameters,we can use better optimizers, like l1 for
Laplace median, orgenerally MLE. These parameters can be
alternatively opti-mized online, e.g. with discussed adaptive
linear regression,however, it might be too costly for data
compression.
Lossy image compressors have a different situation: cod-ing e.g.
DCT transform coefficients, where distribution pa-rameters should
be chosen also based on position - whichshould be included as a
part of the context with someproperly chosen functions.
As we can see in Fig. 3, there is a large spread of behaviorof
parameters, using individual models for separate imagesoften gives
improvement. It suggests to try to segment theimage into regions of
similar behavior, or use a regionclassifier. Having such
segmentation mechanism optimizedfor a large dataset, with separate
models for each segment,they could define default behavior,
avoiding the need of sep-arate model estimation and storage. It
would be valuable tooptimize such segmentation based on used family
of models.Alternative approach is using classifiers and treating
theirevaluation as part of the context, what would
additionallyallow to continuously interpolate between classes.
Finally, while for low cost reason we were focused onlinear
models for parameters, better compression ratios atlarger
computational cost should be achievable using moregeneral models
like neural networks. They are considered inliterature to directly
predict discrete probability distributionsfor pixel values ([5],
[6]). We could reduce the computationalcost if, based on the
context, predicting only parametersof parametric distributions
instead, then finally discretizingthe obtained distribution. For
example Laplace distributionfor unimodal distributions, e.g.
training neural network tominimize sum of |x − µ(c)| and (b(c) − |x
− µ(c)|)2. Formore complex distributions like multimodal, we can
e.g.parameterize density as polynomial, and train to minimizesum of
squares of differences for coefficients of orthonormalpolynomials
as in [9].
APPENDIX
The appendix discusses some general approaches foronline
adaptation of parameters of models to optimize forlocal behavior of
non-stationary time series using updatedsecond order local
approximation of the optimized function.
For this purpose we should define a function we want tooptimize.
This function needs to evolve in time to expresslocal situation we
would like to optimize for. Recently ob-served values bring us
information about this local situation- we can for example use
log-likelihood for them to estimateevolution of parameters of
probability distribution, reducingthe weights of the old
observations to find the current localbehavior. It is convenient to
use exponentially weakeningweights as in exponential moving
average, leading e.g. to(12) adaptive estimation.
1) Second order online parameter optimization: Let uschoose
example of such family of online optimized evolvingcriterion for
(xt) series of observations - for time T + 1 as:
FT+1(θ) =∑t≤T
ηT−tf(xt, θ) = ηFT (θ) + f(xT , θ) (24)
using some coefficient η ∈ (0, 1) and f point-wise evalua-tion
e.g. logarithm of density for log-likelihood. In practicethis sum
is finite - requiring to choose some initial value forabove
recurrence in exponential moving average.
In machine learning such objective/cost functions F (θ)are
usually minimized, so we can use minus logarithm tooptimize
log-likelihood, or some its approximation like firstfew terms of
Taylor expansion:
f(x, θ) = − ln(ρθ(x)) (=∞∑k=1
(1− ρθ(x))k
k)
Now for online minimization of F , a natural assumption isthat
in time T +1 we know θT minimizing FT , and want tofind θT+1
minimizing FT+1. To reduce cost, we would liketo slowly evolve
parameters (θT+1 ≈ θT ), what generallyrequires a caution: might be
suboptimal if F is not convex.
To approximate such preferable step θT+1 − θT , we canuse
derivatives calculated with recurrence as in 24, e.g.:
∂θiFT+1 = η ∂θiF
T + ∂θif(xT , θ)
However, to work on values we would need to fix a point(θ∗)
where such derivatives are taken. We could choose thispoint as some
averaged parameters, and use its perturbedvalues based on online
calculated first two derivatives inthis point e.g. using Newton
method - multidimensional orseparately for each coordinate i:
∀i θTi = θ∗i − ∂θiFT (θ∗) / ∂θiθiFT (θ∗) (25)
For more general evolution of parameters we need to beable to
shift this point of derivation, for example by updatingmodel in two
points simultaneously, using the older one toget the actual
parameters (25), and periodically replacingsuch used point with the
new one, and starting building anew model for a recent θ∗ = θT
point for derivations.
-
8
2) Adaptive minimization with online parabola model:To get a
more continuous update of parameters, alternativeapproach might be
treating (θt,∇θf(xt, θ)|θ=θt)t≤T as asequence of (value, noisy
gradient) like in stochastic gradientdescent. Instead of directly
using second derivative, we cansee neighboring minimum as where the
gradient becomeszero - what can be estimated by finding linear
trend ofgradients and calculating where this trend crosses
zero.
Linear trend of gradients can be calculated in online wayby
using least squares linear regression with exponentiallyweakening
weights [14]. Let us present it in one-dimensionalcase e.g. to
perform optimization separately for each θi pa-rameter, what seems
sufficient for slowly evolving minimum.In multidimensional case it
can alternatively be done usingHessian inversion as in Newton’s
method - which is morecostly, but might be slightly better.
So let us focus on optimization for 1D parameter θ ∈ R,e.g. to
be applied separately to each coordinate for multidi-mensional θ.
Denote its values in successive times as θt andgt = ∂θf(x
t, θ)|θ=θt as corresponding history of gradients.Analogously to
(14), to (θt, gt)t≤T sequence, we would
like to fit parabola f(θ) = h + 12λ(θ − p)2, optimizing
agreement of derivatives gt ≈ f ′(θt) for f ′(θ) = λ(θ − p)using
exponentially weakening weights ηT−t:
arg minλ,p
∑t≤T
ηT−t(gt − λ(θt − p))2 (26)
This least squares linear regression leads ([14]) to λ as (θ,
g)covariance divided by θ variance, and λ−1 learning rategradient
descend for averaged position and gradient:
λ =gθ − g θθ2 − θ2
p = θ − λ−1 g (27)
For x, g, gθ, g2 exponential moving averages (η ∈ (0, 1)):
θT+1
= η θT
+ (1− η) θT
gT+1 = η gT + (1− η) gT
gθT+1
= η gθT
+ (1− η) gT θT
θ2T+1
= η θ2T
+ (1− η) (θT )2
Found p = θ − λ−1 g is modeled minimum if λ > 0.Seeing it as
gradient descend (using averaged gradient andposition), we can e.g.
use absolute value and clipping � > 0:
p = θ −max(|λ−1|, �) g
to handle also negative λ, and λ ≈ 0 situations (e.g.
nearinflection point). Finally we can e.g. use
θT+1 = αp+ (1− α) θT
as parameter evolution step, for some α ∈ (0, 1] describingtrust
in the parabola model.
While online optimization of parameters can be deter-mined by
minimization of evolving criterion (24) like log-likelihood with
exponentially weakening weights, there isstill required parameter
(e.g. η) of exponential movingaverage - defining how conservative
the model should be.It is usually fixed in data compression
(adaptivity) or ma-chined learning (e.g. SGD optimizer) as in a
wide range itsoptimization can only give a relatively tiny
improvement.
So in practice such parameter like η can be just fixed
e.g.optimized over a larger set of data. We could also try toslowly
adapt it to local condition if controlling also ∂θT /∂ηdependence
for used parameters. It would allow to calculate∂f(xT , θT
)/∂η:
∂f(xT , θT )
∂η(ηT ) =
∂θT
∂η(ηT ) · ∇θf(xT , θ)|θ=θT
Treating them as noisy derivatives again, we can use agradient
methods e.g. after some averaging, or find theirlinear trend with
online linear regression like above.
Generally, one could also try to extrapolate e.g. futurebehavior
of parameters based on the recent history, however,it requires
extreme caution.
REFERENCES[1] M. J. Weinberger, G. Seroussi, and G. Sapiro, “The
loco-i lossless
image compression algorithm: Principles and standardization
intojpeg-ls,” IEEE Transactions on Image processing, vol. 9, no. 8,
pp.1309–1324, 2000.
[2] J. Rissanen, “Modeling by shortest data description,”
Automatica,vol. 14, no. 5, pp. 465–471, 1978.
[3] J. Sneyers and P. Wuille, “Flif: Free lossless image format
basedon maniac compression,” in 2016 IEEE International Conference
onImage Processing (ICIP). IEEE, 2016, pp. 66–70.
[4] T. C. Mills and T. C. Mills, Time series techniques for
economists.Cambridge University Press, 1991.
[5] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel
recurrentneural networks,” arXiv preprint arXiv:1601.06759,
2016.
[6] J. Menick and N. Kalchbrenner, “Generating high fidelity
imageswith subscale pixel networks and multidimensional upscaling,”
arXivpreprint arXiv:1812.01608, 2018.
[7] R. Koenker and K. F. Hallock, “Quantile regression,” Journal
ofeconomic perspectives, vol. 15, no. 4, pp. 143–156, 2001.
[8] J. Duda, “Exploiting statistical dependencies of time series
with hier-archical correlation reconstruction,” arXiv preprint
arXiv:1807.04119,2018.
[9] J. Duda and A. Szulc, “Credibility evaluation of income data
with hier-archical correlation reconstruction,” arXiv preprint
arXiv:1812.08040,2018.
[10] P. R. Tadikamalla, “Random sampling from the exponential
powerdistribution,” Journal of the American Statistical
Association, vol. 75,no. 371, pp. 683–686, 1980.
[11] N. Fout and K.-L. Ma, “An adaptive prediction-based
approach tolossless compression of floating-point volume data,”
IEEE Transac-tions on Visualization and Computer Graphics, vol. 18,
no. 12, pp.2295–2304, 2012.
[12] A. Haar, “Zur theorie der orthogonalen funktionensysteme,”
Mathe-matische Annalen, vol. 69, no. 3, pp. 331–371, 1910.
[13] J. Duda, “Fractal wavelets,” 2014. [Online]. Available:
https://github.com/JarekDuda/FractalWavelets
[14] ——, “Sgd momentum optimizer with step estimation by
onlineparabola model,” arXiv preprint arXiv:1907.07063, 2019.
https://github.com/JarekDuda/FractalWaveletshttps://github.com/JarekDuda/FractalWavelets
I IntroductionII Parametric-parametric distributionsII-A Context
dependenceII-B AdaptivityII-C Exponential power distributionII-D
Adaptive least-squares linear regression*
III Practical Laplace example and experimentsIII-A Predictor
(c)III-B Context dependent scale parameter b(c)III-C Entropy
coding, penalty of Golomb codingIII-D Multi-scale interleaving
IV Conclusion and further workAppendix1 Second order online
parameter optimization2 Adaptive minimization with online parabola
model
References