A Numerical Method for the Evaluation of Kolmogorov Complexity, An alternative to lossless compression algorithms

A Numerical Method for the Evaluationof Kolmogorov Complexity

Hector Zenil

Amphitheatre Alan M. TuringLaboratoire d’Informatique Fondamentale de Lille

(UMR CNRS 8022)

Hector Zenil (LIFL) A Numerical Method for the Evaluation of Kolmogorov Complexity 1 / 39

Foundational Axis

As pointed out by Greg Chaitin (thesis report of H. Zenil):

The theory of algorithmic complexity is of course now widelyaccepted, but was initially rejected by many because of the factthat algorithmic complexity is on the one hand uncomputableand on the other hand dependable on the choice of universalTuring machine.

This last drawback is specially restrictive for real world applicationsbecause the dependency is specially true for short strings, and a solutionto this problem is at the core of this work.


Foundational Axis (cont.)

The foundational departure point of the thesis is based in a rather butapparent contradiction, pointed out by Greg Chaitin (same thesis report):

... the fact that algorithmic complexity is extremely, dare I sayviolently, uncomputable, but nevertheless often irresistible toapply ...


Algorithmic Complexity

Foundational Notion

A string is random if it is hard to describe.A string is not random if it is easy to describe.

Main Idea

The theory of computation replaces descriptions with programs. Itconstitutes the framework of algorithmic complexity:

description ⇐⇒ computer program


Algorithmic Complexity (cont.)

Definition

[Kolmogorov(1965), Chaitin(1966)]

K (s) = min{|p|,U(p) = s}

The algorithmic complexity K (s) of a string s is the length of the shortestprogram p that produces s running on a universal Turing machine U.

The formula conveys the following idea: a string with low algorithmiccomplexity is highly compressible, as the information that it contains canbe encoded in a program much shorter in length than the length of thestring itself.


Algorithmic Randomness

Example

The string 010101010101010101 has low algorithmic complexity because itcan be described as 18 times 01, and no matter how long it grows inlength, if the pattern repeats the description (k times 01) increases onlyby about log(k), remaining much shorter than the length of the string.

Example

The string 010010110110001010 has high algorithmic complexity becauseit doesn’t seem to allow a (much) shorter description other than the stringitself, so a shorter description may not exist.


Example of an evaluation of KThe string 01010101...01 can be produced by the following program:

Program A:1: n:= 02: Print n3: n:= n+1 mod 24: Goto 2

The length of A (in bits) is an upper bound of K (010101...01).

Connections to predictability: The program A trivially allows a shortcut tothe value of an arbitrary digit through the following function f(n):

if n = 2m then f (n) = 1, f (n) = 0 otherwise.

Predictability characterization (Shnorr) [Downey(2010)]

simple ⇐⇒ predictablerandom ⇐⇒ unpredictable


Noncomputability of K

The main drawback of K is that it is not computable and thus can only beapproximated in practice.

Important

No algorithm can tell whether a program p generating s is the shortest(due to the undecidability of the halting problem of Turing machines).

No absolute notion of randomness

It is impossible to prove that a program p generating s is the shortestpossible, also implying that if a program is about the length of the originalstring one cannot tell whether there is a shorter program producing s.Hence, there is no way to declare a string to be truly algorithmic random.


Structure vs. randomness

Formal notion of structure

One can exhibit, however, a short program generating s (much) shorterthan s itself. So even though one cannot tell whether a string is randomone can declare s not random if a program generating s is (much) shorterthan the length of s.

As a result, one can only find upper bounds of K and s cannot be morecomplex than the length of that shortest known program producing s.


Most strings have maximal algorithmic complexity

Even if one cannot tell when a string is truly random it is known moststrings cannot have much shorter generating programs by a simplecombinatoric argument:

There are exactly 2n bit strings of length n,

But there are only 20 + 21 + 22 + . . .+ 2(n−1) = 2n − 1 bit strings offewer bits. (in fact there is one that cannot be compressed even by asingle bit)

Hence, there are considerably less short programs than long programs.

Basic notion

One can’t pair-up all n-length strings with programs of much shorter length(there simply aren’t enough short programs to encode all longer strings).


The choice of U matters

A major criticism brought forward against K is its dependence of universalTuring machine U. From the definition:

K (s) = min{|p|,U(p) = s}

It may turn out that:

KU1(s) 6= KU2(s) when evaluated respectively using U1 and U2.

Basic notion

This dependency is particularly troubling for short strings, shorter than forexample the length of the universal Turing machine on which K of thestring is evaluated (typically in the order of hundreds of bits as originallysuggested by Kolmogorov himself).


The Invariance theorem

A theorem guarantees that in the long term different algorithmiccomplexity evaluations will converge to the same values as the length ofthe strings grow.

Theorem

Invariance theorem If U1 and U2 are two (universal) Turing machines andKU1(s) and KU2(s) the algorithmic complexity of a binary string s whenU1 or U2 are used respectively, there exists a constant c such that for allbinary string s:

|KU1(s)− KU2(s)| < c

(think of a compiler between 2 programming languages)

Yet, the additive constant can be arbitrarily large, making unstable (if notimpossible) to evaluate K (s) for short strings.


Theoretical holes

1 Finding a stable framework for calculating the complexity of shortstrings (one wants to have short strings like 000...0 to be alwaysamong the less algorithmic random despite any choice of machine.

2 Pathological cases: Theory says that a single bit has maximal randomcomplexity because the greatest possible compression is evidently thebit itself (paradoxically it is the only finite string for which one can besure it cannot be compressed further), yet one would intuitively saythat a single bit is among the simplest strings.

We try to fill these holes by introducing the concept of algorithmicprobability as an alternative evaluation tool for calculating K (s).


Algorithmic Probability

There is a measure that describes the expected output of a randomprogram running on a universal Turing machine.

Definition

[Levin(1977)]m(s) = Σp:U(p)=s1/2|p| i.e. the sum over all the programs for which U (aprefix free universal Turing machine) with p outputs the string s and halts.

m is traditionally called Levin’s semi-measure, Solomonof-Levin’ssemi-measure or the Universal distribution [Kirchherr and Li(1997)].


The motivation for Solomonoff-Levin’s m(s)

Borel’s typewriting monkey metaphor1 is useful to explain the intuitionbehind m(s):

If you were going to produce the digits of a mathematical constant like πby throwing digits at random, you would have to produce every digit of itsinfinite irrational decimal expansion.

If you place a monkey on a typewriter (with say a 50 keys typewriter), theprobability of the monkey typing an initial segment of 2400 digits of π bychance is (1/502400).

1Emile Borel (1913) “Mecanique Statistique et Irreversibilite” and (1914)“Le hasard”.


The motivation for Solomonoff-Levin’s m(s) (cont.)

But if instead, the monkey is placed on a computer, the chances ofproducing a program generating the digits of π are of only 1/50158

because it would take the monkey only 158 characters to produce the first2400 digits of π using, for example, this C language code:

int a = 10000, b, c = 8400, d, e, f[8401], g; main(){for(; b-c; )f[b + +] = a/5; for(; d = 0, g = c ∗ 2; c- = 14, printf(“%.4d”, e + d/a),e = d%a)for(b = c; d+ = f[b] ∗ a, f[b] = d%–g, d/ = g–, –b; d∗ = b);

Implementations in any programming language, of any of the many knownformulae of π are shorter than the expansions of π and have thereforegreater chances to be produced by chance than producing the digits of πone by one.


More formally said

Randomly picking a binary string s of length k among all (uniformlydistributed) strings of the same length has probability 1/2k .

But the probability to find a binary program p producing s (upon halting),among binary programs running on a Turing machine U is at least 1/2|p|

such that U(p) = s (we know that such a program exists because U is auniversal Turing machine)

Because |p| ≤ k (e.g. the example for π described before), a string s witha short generating program will have greater chances to have beenproduced by p rather than by writing down all k bits of s one by one.

The less random a string the more likely to be produced by a shortprogram.


Towards a semi-measureHowever, there is an infinite number of programs producing s, so theprobability of picking a program producing s among all possible programsis ΣU(p)=s1/2|p|, the sum of all the programs producing s running on theuniversal Turing machine U.Nevertheless, for a measure to be a probability measure, the sum of allpossible events should add up 1. So ΣU(p)=s1/2|p| cannot be a probabilitymeasure given that there is an infinite number of programs contributing tothe overall sum. For example, the following two programs 1 and 2 producethe string 0.

1: Print 0

and:

1: Print 02: Print 13: Erase the previous 1

and there are (countably) infinitely many more.


Towards a semi-measure (cont.)

So for m(s) to be a probability measure, the universal Turing machine Uhas to be a prefix-free Turing machine, that is a machine that does notaccept as a valid program one that has another valid program in itsbeginning, e.g. program 2 starts with program 1, so if program 1 is a validprogram then program 2 cannot be a valid one.

The set of valid programs is said to form a prefix-free set, that is noelement is a prefix of any other, a property necessary to keep0 < m(s) < 1. For more details see (Kraft’s inequality [Calude(2002)]).

However, some programs halt or some others don’t (actually, most do nothalt), so one can only run U and see what programs produce scontributing to the sum. It is said then, that m(s) is semi-computablefrom below, and therefore is considered a probability semi-measure (asopposed to a full measure).


Some properties of m(s)

Solomonoff and Levin proved that, in absence of any other information,m(s) dominates any other semi-measure and is, therefore, optimal in thissense (hence also its universal adjective).

On the other hand, the greatest contributor in the summation of programsΣU(p)=s1/2|p| is the shortest program p, given that it is when the

denominator 2|p| reaches its smallest value and therefore 1/2|p| its greatestvalue. The shortest program p producing s is nothing but K (s), thealgorithmic complexity of s.


The coding theorem

The greatest contributor in the summation of programs ΣU(p)=s1/2|p| is

the shortest program p, given that it is when the denominator 2|p| reachesits smallest value and therefore 1/2|p| its greatest value. The shortestprogram p producing s is nothing but K (s), the algorithmic complexity ofs. The coding theorem [Levin(1977), Calude(2002)] describes thisconnection between m(s) and K (s):

Theorem

K (s) = −log2(m(s)) + c

Notice that the coding theorem reintroduces an additive constant! Onemay not get rid of it, but the choices related to m(s) are much lessarbitrary than picking a universal Turing machine directly for K (s).


An additive constant in exchange for a massivecomputation

The trade-off this is, however, that the calculation of m(s) requires anextraordinary power of computation.

As pointed out by J.-P. Delahaye concerning our method (Pour LaScience, No. 405 July 2011 issue):

Comme les durees ou les longueurs tres petites, les faiblescomplexites sont delicates a evaluer. Paradoxalement, lesmethodes d’evaluation demandent des calculs colossaux.

The first description of our approach was published in Greg Chaitin’sfestchrift volume for his 60th. anniversary: J-P. Delahaye & H. Zenil,“On the Kolmogorov-Chaitin complexity for short sequences,” Randomness andComplexity: From Leibniz to Chaitin, edited by C.S. Calude, World Scientific,2007.


Calculating an experimental m

Main idea

To evaluate K (s) one can calculate m(s). m(s) is more stable than K (s)because one makes less arbitrary choices on a Turing machine U.

Definition

D(n) = the function that assigns to every finite binary string s thequotient:(# of times that a machine (n,2) produces s) / (# of machines in (n,2)).

D(n) is the probability distribution of the strings produced by all n-state2-symbol Turing machines (denoted by (n,2)).

Examples for n = 1, n = 2 (normalized by the # of machines thathalt)

D(1) = 0→ 0.5; 1→ 0.5D(2) = 0→ 0.328; 1→ 0.328; 00→ .0834 . . .


Calculating an experimental m (cont.)

Definition

[T. Rado(1962)]A busy beaver is a n-state, 2-color Turing machine which writes amaximum number of 1s before halting or performs a maximum number ofsteps when started on an initially blank tape before halting.

Given that the Busy Beaver function values are known for n-state 2-symbolTuring machines for n = 2, 3, 4 we could compute D(n) for n = 2, 3, 4.

We ran all 22 039 921 152 two-way tape Turing machines starting with atape filled with 0s and 1s in order to calculate D(4)2

Theorem

D(n) is noncomputable (by reduction to Rado’s Busy Beaver problem).

2A 9-day calculation on a single 2.26 Core Duo Intel CPU.Hector Zenil (LIFL) A Numerical Method for the Evaluation of Kolmogorov Complexity 24 / 39

Complexity Tables

Table: The 22 bit-strings in D(2) from 6 088 (2,2)-Turing machines that halt.[Delahaye and Zenil(2011)]

0 → .328 010 → .000651 → .328 101 → .0006500 → .0834 111 → .0006501 → .0834 0000 → .0003210 → .0834 0010 → .0003211 → .0834 0100 → .00032001 → .00098 0110 → .00032011 → .00098 1001 → .00032100 → .00098 1011 → .00032110 → .00098 1101 → .00032000 → .00065 1111 → .00032

Solving degenerate cases

“0” is the simplest string (together with “1”) according to D.


Partial D(4) (top strings)


From a Prior to an Empirical DistributionWe see algorithmic complexity emerging:

1 The classification goes according to our intuition of what complexityshould be.

2 Strings are almost always classified by length except in cases in whichintuition justifies they should not. For ex. even though 0101010 is oflength 7, it came better ranked than some strings of length shorterthan 7. One sees emerging the low random complexity of 010101...as a simple string.

From m to D

Unlike m, D is an empirical distribution and no longer a prior. Dexperimentally confirms the intuition behind Solomonoff and Levin’smeasure.

Full tables are available online: www.algorithmicnature.org


www.algorithmicnature.org

Miscellaneous facts from D(3) and D(4)

There are 5 970 768 960 machines that halt among the 22 039 921 152in (4,2). That is a fraction of 0.27 halt.

Among the most random looking group strings from D(4) there are :0, 00, 000..., 01, 010, 0101, etc.

Among the most random looking strings one can find:1101010101010101, 1101010100010101, 1010101010101011 and1010100010101011, each with frequency of 5.4447×10−10.

As in D(3), where we reported that one string group (0101010 and itsreversion) climbed positions, in D(4) 399 strings climbed to the topand were not sorted among their length groups.

In D(4) string length was no longer a classification determinant. Forexample, between positions 780 and 790, string lengths are: 11, 10,10, 11, 9, 10, 9, 9, 9, 10 and 9 bits.

D(4) preserves the string order of D(3) except in 17 places out of 128strings in D(3) ordered from highest to lowest string frequency.


Connecting D back to m

To get m we replaced a uniform distribution of bits composing strings to auniform distribution bits composing programs. Imagine that your(Turing-complete) programming language allows a monkey to producerules of Turing machines at random, every time that the monkey types avalid program it is executed.

At the limit, the monkey (which is just a random source of programs) willend up covering a sample of the space of all possible Turing machine rules.


Connecting D back to m

On the other hand, D(n) for a fixed n is the result of running all n-state2-symbol Turing machines according to an enumeration.

An enumeration is just a thorough sample of the space of all n-state2-symbol Turing machines each with fixed probability1/(# of Turing machines in (n,2)) (by definition of enumeration).

D(n) is therefore, a legitimate programmer monkey experiment. Theadditional advantage of performing a thorough sample of Turing machinesby following an enumeration is that the order in which the machines aretraversed in the enumeration is irrelevant as long as one covers all theelements of a (n,2) space.


Connecting D back to m (cont.)

One may ask why shorter programs are favored.

The answer, in analogy to the monkey experiment, is based on the uniformrandom distribution of keystrokes: programs cannot be that long withouteventually containing the ending program keystroke. One can still thinkthat one can impose a different distribution of the program instructions,for ex. changing the keyboard distribution repeating certain keys.

Choices other than the uniform are more arbitrary than just assuming noadditional information, and therefore a uniform distribution (a keyboardwith two or more letter “a”’s rather than the usual one seems morearbitrary than having a key per letter).



Every D(n) is a sample of D(n + 1) because (n + 1, 2) contains allmachines in (n, 2). We have empirically tested that strings sorted byfrequency in D(4) preserve the order of D(3) which preserves the order ofD(2), meaning that longer programs do not produce completely differentclassifications. One can think of the sequence D(1),D(2),D(3),D(4), . . .as samples which values are approximations to m.

One may also ask, how can we know whether a monkey provided with adifferent programming language would produce a completely different D,and therefore yet another experimental version of m. That may be thecase, but we have also shown that reasonable programming languages(e.g. based on cellular automata and Post tag systems) producereasonable (correlated) distributions.




m(s) provides a formalization for Occam’s razor

The immediate consequence of algorithmic probability is simple butpowerful (and surprising):

Basic notion

Type-writing monkeys (Borel)garbage in → garbage out

Programmer monkeys: (Bennett, Chaitin)garbage in → structure out


What m(s) may tell us about the physical world?

Basic notion

m(s) tells that it is unlikely that a Rube Goldberg machine produces astring if the string can be produced by a much simpler process.

Physical hypothesis

m(s) would tell that, if processes in the world are computer-like, it isunlikely that structures are the result of the computation of a RubeGoldberg machine. Instead, they would rather be the result of the shortestprograms producing that structures and patterns would follow thedistribution suggested by m(s).


On the algorithmic nature of the world

Could it be that m(s) tells us how structure in the world has come to beand how is it distributed all around? Could m(s) reveal the machinerybehind?

What happens in the world is often the result of an ongoing (mechanical)process (e.g. the Sun rising due to the mechanical celestial dynamics ofthe solar system).

Can m(s) tell something about the distribution of patterns in the world?We decided to see so we got some empirical datasets from the physicalworld and made a comparison against data produced by pure computationthat by definition should follow m(s).

The results were published in H. Zenil & J-P. Delahaye, “On theAlgorithmic Nature of the World”, in G. Dodig-Crnkovic and M. Burgin (eds),Information and Computation, World Scientific, 2010.


On the algorithmic nature of the world


Conclusions

Our method aimed to show that reasonable choices of formalisms forevaluating the complexity of short strings through m(s) give consistentmeasures of algorithmic complexity.

[Greg Chaitin (w.r.t our method)] ...the dreaded theoretical holein the foundations of algorithmic complexity turns out, inpractice, not to be as serious as was previously assumed.

Our method also seems notable in that it is an experimental approach thatcomes into the rescue of the apparent holes left by the theory.


Bibliography

C.S. Calude, Information and Randomness: An AlgorithmicPerspective (Texts in Theoretical Computer Science. An EATCSSeries), Springer, 2nd. edition, 2002.

G. J. Chaitin. On the length of programs for computing finite binarysequences. Journal of the ACM, 13(4):547–569, 1966.

G. Chaitin, Meta Math!, Pantheon, 2005.

R.G. Downey and D. Hirschfeldt, Algorithmic Randomness andComplexity, Springer Verlag, to appear, 2010.

J.P. Delahaye and H. Zenil, On the Kolmogorov-Chaitin complexity forshort sequences, in Cristian Calude (eds) Complexity and Randomness:From Leibniz to Chaitin. World Scientific, 2007.

J.P. Delahaye and H. Zenil, Numerical Evaluation of AlgorithmicComplexity for Short Strings: A Glance into the Innermost Structureof Randomness, arXiv:1101.4795v4 [cs.IT].


C.S. Calude, M.A. Stay, Most Programs Stop Quickly or Never Halt,2007.

W. Kirchherr and M. Li, The miraculous universal distribution,Mathematical Intelligencer , 1997.

A. N. Kolmogorov. Three approaches to the quantitative definition ofinformation. Problems of Information and Transmission, 1(1):1–7,1965.

P. Martin-Lof. The definition of random sequences. Information andControl, 9:602–619, 1966.

L. Levin, On a concrete method of Assigning Complexity Measures,Doklady Akademii nauk SSSR, vol.18(3), pp. 727-731, 1977.

L. Levin., Universal Search Problems., 9(3):265-266, 1973.(submitted: 1972, reported in talks: 1971). English translation in:B.A.Trakhtenbrot. A Survey of Russian Approaches to Perebor(Brute-force Search) Algorithms. Annals of the History of Computing6(4):384-400, 1984.


M. Li, P. Vitanyi, An Introduction to Kolmogorov Complexity and ItsApplications,, Springer, 3rd. Revised edition, 2008.

S. Lloyd, Programming the Universe: A Quantum Computer ScientistTakes On the Cosmos, Knopf Publishing Group, 2006.

T. Rado, On non-computable functions, Bell System TechnicalJournal, Vol. 41, No. 3, 1962.

R. J. Solomonoff. A formal theory of inductive inference: Parts 1 and2. Information and Control, 7:1–22 and 224–254, 1964.

H. Zenil and J.P. Delahaye, On the Algorithmic Nature of the World,in G. Dodig-Crnkovic and M. Burgin (eds), Information andComputation, World Scientific, 2010.

S. Wolfram, A New Kind of Science, Wolfram Media, 2002.


A Numerical Method for the Evaluation of Kolmogorov Complexity, An alternative to lossless compression algorithms

Education

string s

length of s

low algorithmic complexity

algorithmic complexity

shorter program

shorterthan s

strings of length n

nlength strings