O. Teytaud , S. Gelly , S. Lallich , E. Prudhomme *Equipe I&A-TAO,

1

O. Teytaud *, S. Gelly *,

S. Lallich **, E. Prudhomme **

*Equipe I&A-TAO,

LRI, Université Paris-Sud,

Inria, UMR-Cnrs 8623

**Equipe ERIC,

Université Lyon 2

Email : [email protected], [email protected],

[email protected], [email protected]

Quasi-random resampling

2

What is the problem ?

Many tasks in AI are based on random resamplings :●cross-validation●bagging●bootstrap●...

Resampling is time-consuming●cross-validation for choosing hyper-parameters●bagging in huge datasets

==> we want to have with n resamplings the same

result than with N>>n resamplings

3

A typical example

You want to learn a relation x--> y on a huge dataset.

The dataset is too large for your favorite learner.

A traditional solution is subagging : average 100 learnings performed on random subsamples (1/20) of your dataset

We propose : use QR-sampling to average only 40 learnings.

4

Organization of the talk

(1) why resampling is Monte-Carlo integration

(2) quasi-random numbers

(3) quasi-random numbers in strange spaces

(4) applying quasi-random numbers in resampling

(5) when does it work and when doesn't it work ?

5

Why resampling is Monte-Carlo integration

What is Monte-Carlo integration :

E f(x) ⋲ sum f(x(i)) / n

What is cross-validation:

Error-rate ⋲ E f(x) sum f(x(i)) / n⋲where f(x) = error rate with the partitionning x

6

An introduction to QR-numbers






7

QR-numbers

(2) quasi-random numbers (less randomized numbers)

We have seen that resampling is Monte-Carlo integration,

now we will see how Monte-Carlo integration has been strongly improved.

8

Quasi-random numbers ?

Random samples in [0,1]^d can be not-so-well distributed--> error in Monte-Carlo integration O(1/n) with n

the number of points

Pseudo-random samples ⋲ random samples (we try to be very close to pure random)

Quasi-random samples O(1/n) within logarithmic factors--> we don't try to be as close as possible to random

--> number of samples much smallerfor a given precision

Quasi-random = low discrepancy ?

Discrepancy = Max |Area – Frequency |

A better discrepancy ?

Discrepancy2 = mean ( |Area – Frequency |2 )

Existing bounds on low-discrepancy-Monte-Carlo

Random --> Discrepancy ~ sqrt ( 1/n ) Quasi-random --> Discrepancy ~ log(n)^d/n

Koksma & Hlawka :error in Monte-Carlo integration < Discrepancy x V

V= total variation (Hardy & Krause)

( many generalizations in Hickernel, A Generalized Discrepancy and Quadrature Error Bound, 1997 )

12

Which set do you trust ?

13

Which quasi-random numbers ?

« Halton-sequence with a simple scrambling scheme »

● fast (as fast as pseudo-random numbers) ;● easy to implement ;● available freely if you don't want to implement it.

(we will not detail how this sequence is built here)

(also:

Sobol sequence)

14

What else than Monte-Carlo integration ?

Thanks to various forms of quasi-random :

● Numerical integration [thousands of papers; Niederreiter 92]

● Learning [Cervellera et al, IEEETNN 2004, Mary phD 2005]

● Optimization [Teytaud et al, EA'2005]

● Modelizat° of random-process [Growe-Kruska et al,

BPTP'03, Levy's method]

● Path planning [Tuffin]

15

... and how to do in strange spaces ?






16

Have fun with QR in strange spaces


We have seen that resampling is Monte-Carlo integration, and how Monte-Carlo is replaced by Quasi-Random Monte-Carlo.

But resampling is random in a non-standard space.

We will see how to do Quasi-Random Monte-Carlo in non-standard spaces.

17

Quasi-random numbers in strange spaces

We have seen hypercubes :

18

... but we need something else !

Sample of points ---> QR sample of points

Sample of samples ---> QR sample of samples

19

Quasi-random points in strange spaces

Fortunately, some QR-points exist also in various spaces.

20

Why not in something isotropic ?

How to do it in the sphere ? Or for gaussian distributions ?

21

For the gaussian : easy !

Generate x in [0,1]^d by quasi-random

Build y: P( N < y(i) ) = x(i)

It works because distrib = product of distrib(y(i))

What in the general case ?

22

Ok !

- generate x in [0,1]^d

- define y(i) such that P(t<y(i) | y(1), y(2), ..., y(i-1))=x(i)

Ok !

23

However, we will do that

●We do not have better than this general method for the strange distributions in which we are interested

●At least we can prove the O(1/n) property (see the paper)

●Perhaps there is much better

●Perhaps there is much simpler

24

The QR-numbers in resampling


we have seen that resampling is Monte-Carlo integration,that we were able of generating quasi-random points for any distribution in continuous domains;

==> it should work==> let's see in details how to move the problem to the continuous domain

25

QR-numbers in resampling

A very particular distribution for QR-points : bootstrap samples. How to move the problem to continuous spaces ?

y(i) = x(r(i)) where r(i) = randomly uniformly distributed in [[1,n]] ==> this is discrete

26

QR-numbers in resampling

A very particular distribution for QR-points : bootstrap samples. How to move the problem to continuous spaces ?

We know :

We need :

y(i) = x(r(i)) where r(i) = randomly uniformly distributed in [[1,n]] --> many solutions exist

Rectangularuniformdistribution

Any continuousdistribution

Continuousdistribution

Our discretedistribution

27

What are bootstrap samples ?

Our technique works for various forms of resamplings :- subsamples without replacement (random-CV, subagging)

- subsamples with replacement (bagging, bootstrap)

- random partitionning (k-CV).

W.l.o.g., we present here the sampling of n elements in a sample of size n with replacement (= bootstrap resampling).

(usefull in e.g. Bagging, bias/variance estimation...)

28

A naive solution

y(i) = x(r(i))

r(1),...,r(n) = ceil( n x qr ) where qr [0,1]^n

QR in dimension n with n the number of examples.

0.1 0.9 0.84 0.9 0.7

X

29

A naive solution

y(i) = x(r(i))



0.1 0.9 0.84 0.9 0.7

X

X

30

A naive solution

y(i) = x(r(i))



0.1 0.9 0.84 0.9 0.7

X

X X

31

A naive solution

y(i) = x(r(i))



0.1 0.9 0.84 0.9 0.7

X

X

X X X

32

A naive solution

y(i) = x(r(i))



0.1 0.9 0.84 0.9 0.7==> (1, 0,0,1,3)

X

X

X X X

33

A naive solution

y(i) = x(r(i))



==> all permutations of ( 0.1, 0.9, 0.84, 0.9, 0.7) lead to the same result !

X

X

X X X

34

...which does not work.

In practice it does not work better than random.

Two very distinct QR-points can lead to very similar resamples (permutation of a point lead to the same sample).

We have to remove this symetry.

35

A less naive solution

z(i) = number of times x(i) appears in the bootstrap sample

z(1) = binomialz(2) | z(1) = binomialz(3) | z(1), z(2) = binomial...z(n-1) | z(1), z(2),...,z(n-2) = binomialz(n) | z(1), z(2), ..., z(n) = constant

==> yes, it works !==> moreover, it works for many forms of resamplings and not only bootstrap !

36

With dimension-reduction it's better

Put x(i)'s in k clustersz(i) = number of times an element of cluster i appears in the bootstrap sample

z(1) = binomialz(2) | z(1) = binomialz(3) | z(1), z(2) = binomial...z(k-1) | z(1), z(2),...,z(k-2) = binomialz(k) | z(1), z(2), ..., z(k) = constant

(then, randomly draw the elements in each cluster)

37

Let's summarize

Put x(i)'s in k clustersz(i) = number of times an element of cluster i appears in the bootstrap sample

z(1) = binomialz(2) | z(1) = binomial...z(k) | z(1), z(2), ..., z(k) = constant

we quasi-randomize this z(1),...,z(k)

Then, we randomly draw the elements in each cluster.

38

Let's conclude






39

Experiments

In our (artificial) experiments :● QR-randomCV is better than randomCV● QR-bagging is better than bagging● QR-subagging is better than subagging● QR-Bsfd is better than Bsfd (a bootstrap)

But QR-kCV is not better than kCV

kCV already has some derandomization:

each point appears the same number

of times in learning

40

A typical exampleYou want to learn a relation x--> y on a huge ordered dataset.

The dataset is too large for your favorite learner.

A traditional solution is subagging : average 100 learnings performed on random subsets (1/20) of your dataset

We propose : use QR-sampling to average only 40 learnings.

Or do you have a better solution for choosing 40 subsets of 1/20 ?

41

Conclusions

Therefore:● perhaps simpler derandomizations are enough ?● perhaps in cases like CV in which « symetrizing » (picking each example the same number of times) is easy, this is useless ?

For bagging, subagging, bootstrap, simplifying the approach is not so simple

==> now, we use QR-bagging, QR-subagging and QR-bootstrap instead of bagging, subbagging and bootstrap

42

Further work

Real-world experiments (in progress, for DP-applications)

Other dimension reduction (this one involves clustering)Simplified derandomization methods (jittering, antithetic variables, ...)

Random clustering for dimension reduction ?(yes, we have not tested, sorry ...)

Low Discrepancy ?

Random --> Discrepancy ~ sqrt ( 1/n ) Quasi-random --> Discrepancy ~ log(n)^d/n

Koksma & Hlawka :error in Monte-Carlo integration < Discrepancy x V

V= total variation (Hardy & Krause)

( many generalizations in Hickernel, A Generalized Discrepancy and Quadrature Error Bound, 1997 )

Dimension 1

● What would you do ?

Dimension 1


Dimension 1


Dimension 1


Dimension 1


Dimension 1


Dimension 1

● What would you do ?● --> Van Der Corput● n=1, n=2, n=3...● n=1, n=10, n=11, n=100, n=101, n=110...● x=.1, x=.01, x=.11, x=.001, x=.101, ...

Dimension 1 more general

● p=2, but also p=3, 4, ...

but p=13 is not very nice :

Dimension n

● x --> (x,x) ?

Dimension n

● x --> (x,x') ?

Dimension n : Halton

● x --> (x,x') with prime numbers

Dimension n+1 : Hammersley

● x --> (n/N,x,x') but --> closed sequence

Dimension n : the trouble

● There are not so many small prime numbers

Dimension n : scrambling(when random comes back)

● Pi(p) : [1,p-1] --> [1,p-1]● Pi(p) applied to ordinate with prime p

Dimension n : scrambling

● Pi(p) : [1,p-1] --> [1,p-1] (randomly chosen)● Pi(p) applied to ordinate

with prime p (there is much more complicated)

O. Teytaud *, S. Gelly *, S. Lallich **, E. Prudhomme ** *Equipe I&A-TAO,

Documents

O. Teytaud , S. Gelly , S. Lallich , E. Prudhomme *Equipe I&A-TAO,