Testing Stochastic Processes Through Reinforcement Learning

Testing Stochastic Processes Through

Reinforcement Learning

François Laviolette

Sami Zhioua

Nips-Workshop

December 9th, 2006

Josée Desharnais

Outline

Program Verification Problem

The Approach for trace-equivalence

Other equivalences

Conclusion

Application on MDPs

Stochastic Program Verification

Specification (LMP):an MDP without rewards

Implementation

a[0.5]a[0.3]

b[0.9]cb[0.9]

How far the Implementation is from the Specification ?

(Distance or divergence)

The Specification model is available.

The Implementation is available only for interaction (no model).

1. Non deterministic trace equivalence

Trace Equivalence

Two systems are trace equivalent iff they accept the same set of traces

T(P) = {a, aa, aac, ac, b, ba, bab,

c, cb,cc}T(Q) = {a, ab, ac, abc, abca,

ba, bab, c, ca}

2. Probabilistic trace equivalence

Two systems are trace equivalent iff they accept the same set of traces and with the same probabilities

a[2/3]

a[1/3] b[2/3]

a[1/4]

a[1/4]a[3/4]

b[1/2] c[1/2]

a 7/12

aa 5/12

aac 1/6

bc 2/3

a[1/3]

a[1/2] a[1/2]

a[1/4]a[3/4]

b[1/2]

aa 1/2

Testing (Trace Equivalence)

The system is a black box.

The button goes down (transition)

The button does not go down (no transition)

When a button is pushed

(action execution)

Grammar (trace equiv):

t ::= | a.t

Observations :

When a test is executed, several observations are possible : O t.

b[0.7]

a[0.2]a[0.5]

[2,4) [7,10]

Example:

Ot = {a, a.b, a.b}

0.3 0.56

t = a.b.

Outline

Other equivalences

Conclusion

Application on MDPs

Why Reinforcement Learning ?

a[0.2]a[0.5]

b[0.7]a[0.3]a

b[0.9]

a[0.7]

s1 s2 s3

Reinforcement Learning is particularly efficient in the absence of the full model.

0.5 0.2 0.9

Reinforcement Learning can deal with bigger systems.

Analogy :

LMP MDP

Trace Policy

Divergence Optimal Value ( V* )

A Stochastic Game towards RL

F S S F S F S F F S F S

F F S S S F

S S S F F F

b[0.7]

a[0.2]a[0.5]

b[0.3]a

c[0.4]

c[0.2]

Implementation Specifications0

a[0.2]a[0.3]

b[0.7]b[0.3]a

c[0.8]c[0.7]

b[0.9]

Specification (clone)s0

a[0.2]a[0.3]

b[0.7]b[0.3]a

c[0.8]c[0.7]

b[0.9]

Reward : (+1) when Impl Spec

Reward : (-1) when Spec Clone

MDP Defintion

MDP : Specification LMP StatesActionsNext-state probability distribution

a[0.2]a[0.5]

b[0.7]b[0.3]a

c[0.4]

c[0.2]

a[0.2]a[0.5]

b[0.7]b[0.3]a

c[0.8]c[0.7]

Implémentation Spécification

b[0.9]

s1 s2 s3

0.5 0.2 0.9

1 0.3 0.7

1 0.80.7

Divergence Computation

F F S S S F

S S S F F F

V*(s0)

0 : Equivalent

1 : Different

a[0.2]a[0.5]

b[0.7]b[0.3]a

c[0.4]

c[0.2]

a[0.2]a[0.5]

b[0.7]b[0.3]a

c[0.8]c[0.7]

Implementation Specification

b[0.9]

s1 s2 s3

0.5 0.2 0.9

1 0.3 0.7

1 0.80.7

Symmetry Problem

Implementation Specification

F S S S F F

F F S S S F

+ 1 - 1

Create two variants for each action (a):

Success variant ( a )

Failure variant ( a )

a[0.5]

Spec (Clone)

a[0.5]

Compute and give reward

Give reward 0

Select action make a prediction (, ×)

If pred = obs

If pred obs

Prediction:

execute action

Prob=0*.5*.5+1*.5*.5 = .25Prob=0*.5*.5+1*.5*.5 = .25

The Divergence (with the symmetry problem fixed)

Theorem. Let "Spec" and "Impl" be two LMPs, and M their induced MDP.

V*(s0) ≥ 0, and

V*(s0) = 0 iff "Spec" and "Impl" are trace-equivalent.

Implementation and PAC Guaranty

There exists a PAC Guaranty for Q-Learning Algorithm but ..

Fiechter algorithm has a simpler PAC guaranty.

Besides, it is possible to obtain a bottom bound thanks to the Hoeffding inequality :

If then :

Implementation :

Action selection : softmax ( decreasing from 0.8 to 0.01)

RL algorithm : Q-Learning

decreasing according to the function 1/x

PAC guaranty :

Outline

Other equivalences

Conclusion

Application on MDPs

Testing (Bisimulation)

The system is a black box.

Grammar

t ::= | a.t

b[0.7]

a[0.2]a[0.5]

[2,4) [7,10]

Example:

Ot = {a, a.(b, b), a.(b,b), a.(b,b), a.(b,b)}

0.3 0.518

t = a.(b,b)

0.042 0.042 0.098Pt,s0 :

Replication

| (t1, … , tn)

(bisimulation) :

b[1/3] c[2/3]

a[1/3] a[2/3]

New Equivalence Notion

‘’By-Level Equivalence’’

K-Moment Equivalence

t ::= | a.t

t ::= | ak.t k 2

1-moment (trace)

2-moment

3-moment t ::= | ak.t k 3

: is a random variable such that is the probability to perform

the trace and make a transition to a state that accepts action a with probability pi .

is equal toTwo systems are “By-level’’ equivalent

Recall : kth moment of X = E(Xk) = ( xik . Pr(X=xi) )

Ready Equivalence and Failure equivalence

1. Ready Equivalence

Two systems are Ready equivalent iff for any trace tr and any set of actions A, they have the same probability to run successfully tr and reach a process accepting all actions from A.

a[1/3]

a[1/3] a[2/3]

a[1/4]a[3/4]

b[1/2] b[1/2]

a[1/3]

a[1/2] a[1/2]

a[1/4]a[3/4]

b[1/2]

(<a>,{b,c}) 2/3 (<a>,{b,c}) 1/2

Test t ::= | a.t | {a1, .. , an}

1. Failure Equivalence

a[1/3]

a[1/3] a[2/3]

a[1/4]a[3/4]

b[1/2] b[1/2]

a[1/3]

a[1/2] a[1/2]

a[1/4]a[3/4]

b[1/2]

(<a>,{b,c}) 1/3 (<a>,{b,c}) 1/2

Two systems are Ready equivalent iff for any trace tr and any set of actions A, they have the same probability to run successfully tr and reach a process refusing all actions from A.

Test t ::= | a.t | {a1, .. , an}

1. Barb acceptation

a[1/3]

a[1/3] a[2/3]

a[1/4]a[3/4]

b[1/2] b[1/2]

a[1/3]

a[1/2] a[1/2]

a[1/4]a[3/4]

b[1/2]

Barb equivalence

(<a,b>,<{a,b},{b,c},>) 2/3

2. Barb Refusal

a[1/3]

a[1/3] a[2/3]

a[1/4]a[3/4]

b[1/2] b[1/2]

a[1/3]

a[1/2] a[1/2]

a[1/4]a[3/4]

b[1/2]

(<a,b>,<{b,c},{b,c}>) 1/3

Test t ::= | a.t | {a1, .. , an}a.t

Outline

Other equivalences

Conclusion

Application on MDPs

s1 s2 s3

0.8 0.2 1

1 0.3 0.7

r1 r2 r3

r3 r4 r5

r7 r8r6

s1 s2 s3

0.5 0.2 0.9

r1 r2 r3

r3 r4 r5

Application on MDPs

Case 3 : The reward space is very large (continuous) : w.l.o.g. [0,1]

Case 1 : The reward space contains 2 values (binary) : 0 and 1

Case 2 : The reward space is small (discrete) : {r1, r2, r3, r4, r5}

Application on MDPs

Case 1 : The reward space contains 2 values (binary)

r1 : 0 F

r2 : 1 S

Case 2 : The reward space is small (discrete)

{r1, r2, r3, r4, r5}

r2 ar3 a

r4 ar5

r2 br3 b

r4 br5

Case 3 :

The reward space is very large (continuous)

Intuition : r = 3/41 with probability 3/4

a rpick a reward value (ranVal)

randomly

ranVal r

ranVal < r

0 with probability 1/4

Current and Future Work

Application to different equivalence notions :- Failure equivalence- Ready equivalence- Barb equivalence, etc.

Experimental analysis on realistic systems

Applying the approach to compute the divergence between : - HMMs

- POMDPs

Studying the properties of the divergence

- Probabilistic automata

Testing Stochastic Processes Through Reinforcement Learning

ready equivalence

trace tr

testing trace

performthe trace

test t

iff spec

mdpswhy reinforcement

new equivalence notion

Documents

STOCHASTIC PROCESSES FOR PHYSICISTS Understanding Noisy...

Discrete Stochastic Processes, Chapter 2: Poisson Processes

LEARNING AND DESIGNING STOCHASTIC PROCESSES FROM … ·...

Stochastic Processes (Master degree in...

Stochastic Processes ActSci

Stochastic Processes - Sharif

Stochastic selection processes

Hamid R. Rabiee Ali Jalali Stochastic Processes Elements of....

Stochastic Processes: Introduction

Stochastic Optimal Control – part 2 discrete time, Markov....

Stochastic EMÜ439 Processes

Interacting stochastic processes

Stochastic Processes - NKD GroupStochastic Processes Edited:...

4 stochastic processes

Stochastic Processes, Kalman Filtering and Stochastic...

Random Processes Random or Stochastic processes