The Estimations Based on the Kolmogorov Complexity and ...

The Fifth International Conference on Neural Networks and Artificial IntelligenceMay 27-30, Minsk, Belarus

The Estimations Based on the Kolmogorov Complexity and Machine Learning from Examples

Vladimir I. Donskoy

Taurian National University, 4, Vernadsky Avenue, Simferopol, 95007, Ukraine, [email protected]

Abstract - In this paper, interrelation between the Kolmogorov complexity and VCD of classes of the partial recursive functions used in machine learning from examples is researched. The novel pVCD method of programming of estimations of VCD and the Kolmogorov complexity is proposed. It is shown, how Kolmogorov complexity can be used for the substantiation of the significance of regularities discovered in the training samples.

Keywords - Kolmogorov complexity, VCD, Machine Learning, Samples.

I. INTRODUCTION

Examining the problems of machine learning, it is natural to limit the class of used decision functions by the partly recursive functions. In that case, we need to use algorithmic approach to machine learning and to examine algorithmic complexity of models. Statistical Vapnik-Chervonenkis theory of learning [1], and Kolmogorov approach [2], and MDL [3], and various heuristics used in Machine Learning - are based on the concepts of complexity of models which are used to find regularities or decision making rules. From the different points of view, when the learning from examples is used, it is expedient to choose as possible simpler deciding rule (model). The nature of the arising problem can be looked as decree of Nature: regularity almost always has to be very simple or has very simple description, by other words, - low complexity.

In this paper, Vapnik-Chervonenkis Theory is extensively used. This theory begins with the concepts of shatter coefficient and Vapnik-Chervonenkis Dimension

(VCD) [4]. Let be a sample ; ,

; is a set which is defined in the applications.

is the number of various partitions of

the sample on two classes which can be realized by

the rules (functions) of the family . It

is evident that . The function

is called the growth function of the family or the l-th shatter coefficient of [4]. The set of all possible

samples is denoted . The growth function

either is identically equal to , or majorized by function

, where is the minimum value of the , on

which . The following definition is based on

the estimation : if exists such

that for any , then it is said that the

family has finite capacity (or VCD(S)). If

then it is said that VCD is infinite:

. If card then and

. The main result of the statistical Vapnik-Chervonenkis theory is: the finiteness of the guarantees the learning ability by the method of empirical induction when classification rule is chosen from the family S. The fundamental inequality

is used to estimate the length of a sample which is necessary for a guaranty that empirical error of the learning (frequency ratio) of the classification rule

will be -close to the unknown probability of the

error of this rule. The main purpose of this paper is to analyze a process of machine learning from examples when the recursive function families are used. To achieve this purpose, we

defined the Kolmogorov complexity [5] of the

family of recursive functions and proved the inequality

.The novel method of estimation both

and is proposed. Finally, the majorant

is obtained for the probability of the random

choice of the recursive rule , which is absolutely

correct on all examples of a sample of length , when

this rule is found by means machine learning. The results obtained in this paper are based on the Kolmogorov approach supposing consideration of nonrandomness as regularity.

II. KOLMOGOROV COMPLEXITY OF THE RECURSIVE CLASSIFIERS

Let be a family of general recursive functions (of

algorithms) in the form of

. A

training sample, which is denoted as ,

contains arbitrary elements from the . This sample presents a sequenced collection which consists of the

limited natural numbers. The bounded set of all

these samples is denoted as and it requires bit to

present states. The set of 0-1-strings (words) of arbitrary length as usually presents numbers 0, 1, 2,… . A length of the string is denoted

. is the class of partly recursive functions. We define more exactly the training sequence or the

sample as the pairs , where ,

, ; is some a priory

unknown, but existing classification function. The set of

all possible training samples is denoted as .

This set is a general population from which samples can be extracted. The machine learning problem consists in finding the unknown function by using given sample

. Practically, the result of machine learning is

the function , which is not equal to , but which is,

in a certain sense, as possible closer to . The family is defined by a choice of the model of machine learning (and by the corresponding family of classification algorithms), for example, by decision trees, neural networks, potential functions, and another heuristics. The most intricate problem is a determination of the family

which is relevant, adequate to the initial information ;

therefore empirical learning problems are so complicated.

Definition 1.

1º The complexity of the algorithm relatively to the

sample by the partly recursive function is

where is a binary word of the

length .

2º The complexity of the algorithm at the set by

the partly recursive function is

3º The complexity of the family of algorithms at the

set by the partly recursive function is

4º The complexity of the family of algorithms at the set

is .

Theorem 1. Let the family of the partly recursive

functions has finite and Kolmogorov

complexity . Then

for any and .

Proof. The complexity of the family is

defined by the expression , where a

binary word fixes the variant of the partition of the

sample on two subsets. All possible variants of such

partitions are defined by functions of the family . For the function the binary word is defined by

expression , and moreover, if the functions A

and B from are equivalent, the binary word

and are the same. If the partly

recursive function is fixed, it needs the equality

= be fulfilled for any on any

sample according to definition 1. Therefore, the

argument must to admit not less than the number

of values, where is the growth function of

the family . Remind, that is the maximum

number of various partitions of the sample , therefore

defines the maximum possible number of various

binary words of the length for all samples from . And because that is a function, the inequality

takes a place. Furthermore, the equality

= (1)

is true. Really, it is sufficient to point the function

such that = . This

function can be defined by the following table 1

consisting of the cells.

TABLE I

DETERMINATION OF THE FUNCTION

The code(number)

of theprogram

The code (number) of the sample

… …

0 … … … … …… … … … … …

… … … … … …… … … … …

The values , , , contained in the table, are the binary words of the length , which are the binary natural numbers. We mean the

natural numbers extended with a zero. Just as values

, samples and codes are interpreted as natural

numbers. So, the function can be defined on the finite set of values of arguments presented in the table 1. On any other admissible values of arguments which are not contained in this table, the function can be determined as a zero. We remind: the natural functions of natural arguments, which have nonzero values only on a finite set from their domain of definition, are recursive.

Under the conditions and the following expressions take a place:

,

.

And finally, taking into account the equality (1), we get

.

Corollary 1. The Kolmogorov complexity of the family of algorithms is equal to the least whole number, which is more or equal to the logarithm of the l-th shatter coefficient

of this family : = .

Corollary 2.

III. THE METHOD OF PROGRAMMING

OF ESTIMATIONS OF VCD AND SHATTER COEFFICIENTS

The complexity of the class of algorithms is

defined above as the minimum length of the binary word (program) , which can be used to define the word

by means the corresponding

partly recursive function (external algorithm) in the

most unfavorable case of the set of samples and

algorithms from . It is evident, that

for any function .

Therefore, for the upper estimation of the , any

Turing machine can be used alternatively as

algorithm , if this calculates .

The appropriate program in any program language

such that for the input can be

used as well as a Turing machine. So, if the word and an appropriate way of calculation of are defined, then

VCD can be estimated: and

. The novel so-called method of programming of the estimation of VCD is based on the inequality , where the word is defined by the expressions

and . Taking in account equality

= (corollary 1), we have

and for

any . The shutter coefficient can be

estimated by inequality . The following

very important detail must be underlined. As we noted above, we consider binary strings as natural numbers,

therefore the algorithm transforms the pair of

the natural numbers into the natural number . When is found as a number, this number must be decoded into the string of the length . To present the number as a binary string we need to have information about the value of , so we need binary digits added into the word , which defines any

algorithm . We denote and

. To realize method the

following steps must be done: 1º Analysis of the family ; definition as more restricted set of parameters and/or properties of this family in order to form the structure for the word , which completely defines any algorithm . Pointing out the algorithm (the Turing machine, the partly recursive function, the program for the any computer) such that

. 2º Definition of the length of the word ,

, for the upper estimation of the , or

as the upper estimation of the .The

method suggests designing of the compressed description for any element of the family and the

algorithm which processes the input . In

particular, it is sufficiently of evidence of existing of such algorithm, but generally, to use the art of programming and of data organization are needed to present the structure of the word and the algorithm . If we use a computer with register capacity , and the

algorithms from the family use this register capacity to present any parameter of the algorithm, the more detailed estimation can be obtained.

We illustrate the method for the family

of Binary Decision Trees with not more than

terminal nodes. We suppose Boolean samples and

space dimension . Every internal node of any tree from

contains the number of Boolean variable from

the set and two pointers: the left and the right. Each pointer defines a transition to the next node according to the value of this variable. Any terminal node contains the number of a class (the result of computation) 0 or 1. The tree with and is shown on the Fig.1. Any tree defines the algorithm

. This algorithm can be compressed

into the word by the following way. The word consists of the concatenation of the fragments containing the number of Boolean variable and the generalized pointer as it shown on the Fig.2. Finally, these fragments as well as the whole word are presented as binary numbers. The meaning of the generalized pointer

is explained in the Table 2.

Fig.1. The BDT with internal nodes

Fig.2. The structure of the fragment

TABLE I I

THE MEANING OF THE GENERALIZED POINTER

Value Explanations

0 return_class( )

1 return_class( )

2 If then return_class( )

else next_fragment3 If then return_class( )



else next_fragment6 If then goto_fragment( )

else next_fragment… …… …

If then goto_fragment( )

else next_fragment

Now we can write the word which contains all information which is need to decode the tree given on fig.1. This word consists of four concatenated fragments corresponding to the internal nodes

. Each fragment consists of two fields presented in decimal form for easy understanding. But below we should suppose binary fixed fields of all fragments. According to the Table 2, we have

Note, the fragments with the indexes 0 and 1 in the word never need be pointed. Therefore the generalized

pointer always points to indexes of fragments beginning from 2. The algorithm which decodes the given word can be easily understood.

a) Get a fragment 0. b) Decode the number extracted from the

first field of the fragment, and the value extracted from the second fragment.

c) Execute the program code for the value of according to the Table 2. The result – the number of a class - will be obtained, and the algorithm will be stopped; or the transition operator to the fragment pointed by value , or a transition operator to the next concatenated fragment will be completed. We explain the procedures used: return_class( ) – returns the answer 0 if or the answer 1 if and then the algorithm is ended; next_fragment – transition to the right to the next fragment; goto_fragment( ) –

transition to the fragment number .

To encode any tree , at most

fragments are needed because is the number of internal nodes if the number of terminal nodes is . Thus, generalized pointer has to possess special values

and values to point fragments indexed as . Therefore the number

of values for the generalized pointer is needed and to encode them, the number

of binary digits is needed. Finally,

binary digits are needed to encode one fragment, and the length of the binary word

is obtained:

Note, of binary digits are added into the word to define the length of the binary string which is the

output of the algorithm . Since a never

depend of the sample length , the addition must

be excluded from the when VCD is estimated

by the method. Taking into account the

inequality , we get the

following estimation for the family of Binary Decision Trees with at most terminal nodes when at most binary variables are used:

.

For the family of Binary Decision Trees with at

most terminal nodes, with a linear predicates in any internal node, with at most variables, and with coefficients and variable values presented in digits per word, we easily get the estimation

.

Note, that the family is very extensible class of

algorithms, therefore the estimation of

defines large values when all variables are used to define a linear separating rule in any internal node. For the neural networks with nodes in the single hidden layer the following estimation is obtained:

= .

IV. VERIFICATION OF SIGNIFICANCE LEVEL OF REGULARITIES DISCOVERED IN EMPIRICAL DATA IN

THE TERMS OF THE KOLMOGOROV APPROACH

Definition 2. Let be fixed sample given from

, - the family of algorithms used for training.

The solution of the functional system (2), if it exists,

is called a correct tuning on the sample . The

solution of the functional system (3), if it exists, is called a

tuning on the fixed elements of the

sample .

(2)

(3)

Evidently, a tuning on the fixed elements

of the sample is a correct tuning on

some part of the sample . In the machine learning

problems, as usually, the sample is random and

independently derived from the general population . Below we use the model with deriving from the general

population . In the random derived pair

, the Boolean vector appears with a certain

probability. When the correct tuning is realized by some way and there are no errors on the given sample,

the values on the set can be arbitrary,

and the decision rule , which is found, can be erroneous

generally speaking on any . In other words,

a direct solving of the systems (2) or (3) is absolutely not equivalent to learning from examples! To realize an empirical induction based on the sample, it is necessary to generalize properties of this sample to obtain not only zero empirical error on this sample, but as possibly less

errors on all admissible objects of the set . What

is happened when we chose the family which contains the correct tuning on the given sample, but not contains the true (or close to the true) regularity which generates samples derived from the general population

? We consider such event as a random tuning on the sample.

Theorem 2. Let the probability model of derivation

from the general population is such that an

appearance of any Boolean vector in the arbitrary

derived pair is equally probable. Then the

probability of a random tuning on some

elements of the sample satisfies the inequality

,

where is the Kolmogorov complexity of the

family , - the number of errors assumed on the

training sample by the algorithm realized as

result of training. Proof. The family unambiguously generates the finite

set of various ways of a classification for any

given sample . Cardinality of the set is at

most . A correct tuning on all elements of a

sample can be realized if and only if the way of a

classification of the sequence onto two classes is

contained in the set (in other words, when a

random extraction of a sample is realized, the vector

“hits” into the set ). Any possible which

can be presented in a sample is equally probable according to the condition of the theorem. Therefore probability of a correct tuning on the fixed part of the sample of a length of is at most .

The elements from can be chosen by ways.

Therefore we have the estimation

.

According to the corollary 1, ,

.Therefore

.

Corollary 3. A probability of a random

correct tuning on the sample satisfy the

inequality .

Corollary 4. If the estimation of the Kolmogorov complexity by the pVCD method is obtained so that

then .

If then ,

and nonrandomness of the regularity found will be not less 0,96. This is acceptably on a practice. Thus we have The rule ”of plus five”: To obtain reliable regularity when machine learning is used, the length of the training sample must be more on 5 than Kolmogorov complexity of the algorithm family used.

CONCLUSION

The novel method presented in this paper

allows to estimate both and complexity

of the family of learning algorithms by using

technique of programming, what defines advantages as compared to more complicated combinatorial approach. The possible applications of the presented results are

the following: obtaining of the novel estimations of the; reliability estimation of the algorithms which are

found as a result of machine learning; estimation of the required lengths of training samples.

REFERENCES

[1] V. N. Vapnik. Recovery of dependencies by empirical data. Moscow: Nauka, 1979 (In Russian).[2] A. N. Kolmogorov. Information theory and theory of algorithms. Moscow: Nauka, 1987 (In Russian).[3] P. M. B. Vitanyi, M. Li. Minimum Description Length Induction, Bayesianism, and Kolmogorov complexity, IEEE Trans. on Inf. Theory, 46(2), pp. 446-464.[4] L. Devroye, L. Györfi, G. Lugosi. A Probabilistic Theory of Pattern Recognition. NY:Springer-Verlag, 1997.[5] V. I. Donskoy. Kolmogorov complexity of the classes of partly recursive functions with a restricted capacity, Tavrian Herald for Computer Science Theory and Mathematics, 1, 2005, pp. 25-34. (In Russian).

The Estimations Based on the Kolmogorov Complexity and ...

Documents