Novel Function Approximation Techniques for Large-scale Reinforcement Learning A Dissertation by Cheng Wu to the Graduate School of Engineering in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the field of Computer Engineering Advisor: Prof. Waleed Meleis Northeastern University Boston, Massachusetts April 2010 in which submitted to GPO
130
Embed
Novel Function Approximation Techniques for Large-scale ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
values shown represent the final converged value of the solution rate. The results indicates
that the fraction of test instances solved increased from 73.3% to 96.4% for the 8x8 grid,
from 32.3% to 88.8% for the 16x16 grid, and from 20.8% to 76.4% for the 32x32 grid as the
number of prototypes increases.
By comparing with Table 2.2, we see that adaptive Kanerva Coding achieves a lower
average solution rate when solving hard test instances than when solving easy test instances
when the number of prototypes and the size of the grid are held constant.
Figure 3.1 shows the average fraction of hard test instances solved by Q-learning with
adaptive Kanerva Coding with 2000 tiles as the size of the grid varies from 8x8 to 32x32. The
graph shows how the solvers converge as the number of epochs increases. The results show
that when using adaptive Kanerva-based function approximation with 2000 prototypes, the
fraction of test instances solved decreases from 94.9% to 67.9% as the grid size increases.
These results indicate that although it improves on traditional Kanerva Coding, the
fraction of test instances solved using adaptive Kanerva Coding still decreases sharply as
CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 40
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Aver
age
solu
tion
rate
Epoch
(1) 8X 8 Grid (2) 16X16 Grid (3) 32X32 Gird
(1)
(2)
(3)
Figure 3.1: The fraction of easy and hard test instances solved by Q-learning with adaptiveKanerva Coding with 2000 prototypes.
the size of the grid increases when applied to hard test instances. Feature optimization only
improves the efficiency of function approximation to a certain extent, and cannot solve hard
instances of large-scale system. We need to further explore other factors that may be causing
poor performance.
3.2 Prototype Collisions in Kanerva Coding
Kanerva Coding is an implementation of SDM for reinforcement learning. A collection of
k prototypes is selected, each of which corresponds to a binary feature. A state-action pair
sa and a prototype pi are adjacent if their bit-wise representations differ by no more than a
threshold number of bits. The threshold is typically set to 1 bit. We define the adjacency
grade adji(sa) of sa with respect to pi to be equal 1 if sa is adjacent to pi, and equal to
0 otherwise. A state-action pair’s prototype vector consists of its adjacency grades with
CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 41
!"#
!$#
!%#
!&#
sa1
sa2
SA
!"#
!$#
!%#
!&#
sa1
sa2
SA
collision
!"#
!$#
!%#
!&#
sa1 sa
2
SA collision
(a) (b) (c)
Figure 3.2: The illustration of prototype collision. (a) adjacent to no prototype; (b) adjacentto an identical prototype set; (c) adjacent to unique prototype vectors.
respect to all prototypes. A value θ(i) is maintained for the ith prototype, and Q(sa), an
approximation of the value of a state-action pair sa, is then the sum of the θ-values of the
adjacent prototypes; that is,
Q(sa) =∑i
θ(i) ∗ adji(sa).
A prototype collision is said to have taken place between two distinct state-action pairs,
sai and saj, if and only if sai and saj have the same prototype vector, that is, the same
adjacency grades over all prototypes.
In Kanerva Coding, for two arbitrary state-action pairs, there are three possible cases: the
state-action pairs are both adjacent to no prototypes, the state-action pairs have identical
prototype vectors, or the state-action pairs have distinct prototype vectors, as shown in
Figure 3.2. Kanerva Coding works best when each state-action pair has a unique prototype
vector, where no prototype collision takes place. If prototypes are not well distributed across
CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 42
the state-action space, many state-action pairs will either not be adjacent to any prototypes,
or adjacent to identical sets of prototypes, corresponding to identical prototype vectors. If
two similar state-action pairs are adjacent to the same set of prototypes, their state-action
value are always same during the learning process. Typically, the solver needs to distinguish
such state-action pairs, which is not possible in this case. Such prototype collisions reduce
the quality of the results, and the estimates of Q-values of such state-action pairs will be
equal [49]..
The collision rate in Kanerva Coding is the fraction of state-action pairs that are
either adjacent to no prototypes, or adjacent to the same set of prototypes as some other
state-action pair. The larger the value of the collision rate, the more frequently prototype
collisions will occur during Kanerva-based function approximation. The prototype collision
is therefore inversely proportional to the learning performance of a reinforcement learner
with Kanerva-based function approximation.
Selecting a set of prototypes that distinguishes frequently-visit distinct state-actions pairs
can improve the solver’s ability to solve the problem. However, it is difficult to generate such a
set of prototypes for several reasons: the space of possible subsets is very large, and the state-
action pairs encountered by the solver depend on the specific problem instance being solved.
Dynamic prototype allocation and adaptation removes unnecessary prototypes and adds
new prototypes that cover parts of the state-action space that are frequently visited during
instance-based learning. In this way, prototypes can be adaptively adjusted to minimize
prototype collisions for the specific problem domain.
CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 43
9.4% 0.6%
28.1%
2.8%
43.3%
9.2%
15.6%
7.9%
22.6%
12.3%
28.2%
20.3%
75.0%
91.5%
49.3%
84.9%
28.5%
70.5%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Traditional Adaptive Traditional Adaptive Traditional Adaptive
% o
f Sta
te-a
ctio
n Pa
irs
Grid size
Adjacent to a unique prototype set Adjacent to a non-unique prototype set Adjacent to no prototypes
8X8 16X16 32X32
Figure 3.3: Prototype collisions using traditional and adaptive Kanerva-based function ap-proximation with 2000 prototypes.
In order to evaluate the negative effect of prototype collisions, we observe the collision
rates produced when using traditional Kanerva Coding and adaptive Kanerva Coding as the
size of the grid varies. Figure 3.3 shows the fraction of state-action pairs that are adjacent
to no prototypes, adjacent to identical sets of prototypes, and adjacent to a unique set
of prototypes when traditional Kanerva Coding and adaptive Kanerva Coding with 2000
prototypes are applied to easy predator-prey instances of varying sizes. Here, the collision
rate is the sum of the fraction of state-action pairs that are adjacent to no prototypes and the
fraction of state-action pairs that are adjacent to identical sets of prototypes. These results
show that for the traditional algorithm, the collision rate increases from 25.0% to 71.5% as
the size of grid increases. For the adaptive algorithm space, the collision rate increases from
8.5% to 29.5% as the size of the grid increases.
CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 44
The results also suggest that the improved performance of the adaptive Kanerva algo-
rithm over the traditional algorithm occurs with the reduction of prototype collisions. For
example, the adaptive Kanerva algorithm reduces the collision rate from 71.5% to 29.5%
while the average solution rate for the adaptive algorithm increases for a grid size of 32x32.
However, the results also indicate that while the adaptive mechanism successfully reduces
the number of collisions caused by the fraction of state-action pairs that are adjacent to no
prototypes, it is not as successful at reducing the number of collisions caused by the fraction
of state-action pairs that are adjacent to identical sets of prototypes. For example, the
adaptive algorithm reduces the number of collisions caused by state-action pairs that are
adjacent to no prototypes by 91.7% in the 8x8 grid, by 90.0% in the 16x16 grid, and by
78.8% in the 32x32 grid. But it reduces only 49.4% of the collisions caused by the fraction of
state-action pairs that are adjacent to identical sets of prototypes in the grid of 8x8, 45.6%
in the grid of 16x16 and 28.0% in the grid of 32x32.
For further clarify the effect of prototype collisions to the efficiency of Kanerva-based
function approximation, we evaluate the performance of traditional and adaptive Kanerva-
based function approximation and their corresponding collision rates using different number
of prototypes and different sizes of grids. Figure 3.4 shows the fraction of test instances solved
(solution rate) and the fraction of of state-action pairs that are adjacent to no prototypes and
adjacent to identical prototype vectors (collision rate) by traditional and adaptive Kanerva-
based function approximation as the number of prototypes varies from 300 to 2500 in the
grid of varying sizes from 8 to 32.
CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 45
0%
20%
40%
60%
80%
100%
300 600 1000 1500 2000 2500
Sol$on Rate
Number of Prototypes
Tradi0onal
Adap0ve
(a) 8 x 8
0%
20%
40%
60%
80%
100%
300 600 1000 1500 2000 2500
CollisionRate
NumberofPrototypes
Adjacenttouniqueprototypeset
Adjacenttoiden;calprototypeset
Adjacenttonoprototype
3006001000150020002500
Tra.Ada.Tra.Ada.Tra.Ada.Tra.Ada.Tra.Ada.Tra.Ada.
(b) 8 x 8
0%
20%
40%
60%
80%
100%
300 600 1000 1500 2000 2500
Sol$on Rate
Number of Prototypes
Tradi0onal
Adap0ve
(c) 16 x 16
0%
20%
40%
60%
80%
100%
300 600 1000 1500 2000 2500
CollisionRate
NumberofPrototypes
Adjacenttouniqueprototypeset
Adjacenttoiden;calprototypeset
Adjacenttonoprototype
3006001000150020002500
Tra.Ada.Tra.Ada.Tra.Ada.Tra.Ada.Tra.Ada.Tra.Ada.
(d) 16 x 16
0%
20%
40%
60%
80%
100%
300 600 1000 1500 2000 2500
Sol$on Rate
Number of Prototypes
Tradi0onal
Adap0ve
(e) 32 x 32
0%
20%
40%
60%
80%
100%
300 600 1000 1500 2000 2500
CollisionRate
NumberofPrototypes
Adjacenttouniqueprototypeset
Adjacenttoiden;calprototypeset
Adjacenttonoprototype
3006001000150020002500
Tra.Ada.Tra.Ada.Tra.Ada.Tra.Ada.Tra.Ada.Tra.Ada.
(f) 32 x 32
Figure 3.4: Average fraction of test instances solved (solution rate) (a) 8x8 grid; (c) 16x16grid; (e) 32x32 grid, and the fraction of of state-action pairs that are adjacent to no pro-totypes and adjacent to identical prototype vectors (collision rate) (b) 8x8 grid; (d) 16x16grid; (f) 32x32 grid by traditional and adaptive Kanerva-based function approximation asthe number of prototypes varies from 300 to 2500.
CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 46
The values shown represent the final converged value of the solution rate. The results
show that, when using traditional Kanerva Coding, the solution rate increases from 57.2%
to 93.5% while the collision rate decreases from 83.7% to 22.9% for the 8x8 grid, the solution
rate increases from 28.5% to 82.3% while the collision rate decreases from 65.7% to 48.4%
for the 16x16 grid and the solution rate increases from 7.9% to 43.2% while the collision rate
decreases from 89.7% to 70.8% for the 32x32 grid, as the number of prototypes increases. As
a comparison, when using adaptive Kanerva Coding, the solution rate increases from 81.3%
to 99.5% while the collision rate decreases from 50.0% to 8.8% for the 8x8 grid, the solution
rate increases from 49.6% to 96.1% while the collision rate decreases from 65.7% to 16.2% for
the 16x16 grid and the solution rate increases from 23.3% to 92.4% while the collision rate
decreases from 84.9% to 25.4% for the 32x32 grid, as the number of prototypes increases.
These results indicate that the fraction of test instances solved decreases sharply while the
collision rate increases sharply using both traditional and adaptive Kanerva-based function
approximation with different number of prototypes across all sizes of grids. The results also
indicate that adaptive Kanerva Coding has better learning performance and causes fewer
prototype collisions over traditional Kanerva Coding and the tendency is magnified as the
size of grid increases .
However, the performance of the adaptive algorithm on large instances is still poor as the
number of prototypes decreases, as shown in Figure 3.4. It is therefore necessary to consider
a more effective approach for reducing the collision rate as the dimension of the state-action
space increases.
CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 47
3.3 Adaptive Fuzzy Kanerva Coding
A more flexible and powerful approach to function approximation is to allow a state-action
pair to update θ-values of all prototypes, instead of a subset of neighbor prototypes. Instead
of being binary values, we use fuzzy membership grades that vary continuously between 0
and 1 across all prototypes. Such fuzzy membership grades are larger for closer prototypes
and smaller for more distant prototypes. Since prototype collisions occur only when two
state-action pairs have the same real values in all elements of their membership vectors,
collisions are less likely.
In the traditional Kanerva Coding, a collection of k prototypes is selected. A state-
action pair sa and a prototype pi are said to be adjacent if their bit-wise representations
differ by no more than a threshold number of bits. To introduce fuzzy membership grades,
we reformulate this definition of traditional Kanerva Coding using fuzzy logic [16, 13, 44].
We define the membership grade µi(sa) of s with respect to pi
µi(sa) =
1 if sa is adjacent to pi,
0 otherwise.
A state-action pair’s membership vector consists of its membership grades with respect to
all prototypes. A value θ(i) is maintained for the ith feature, and Q̂(sa), an approximation of
the value of a state-action pair sa, is then the sum of the θ-values of the adjacent prototypes.
That is
Q̂(sa) =∑i
θ(i)µi(sa).
CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 48
Figure 3.5: Sample membership function for traditional Kanerva Coding.
Therefore Kanerva Coding can greatly reduce the size of the value table that needs to be
stored.
Figure 3.5 gives an abstract description of the distribution of a state-action pair’s mem-
bership grade with respect to each element of a set of prototypes. The figure shows the
regions of the state-action space where prototype collisions take place. Note that receptive
fields with crisp boundaries can cause frequent collisions.
3.3.1 Fuzzy and Adaptive Mechanism
In our fuzzy approach to Kanerva Coding, the membership grade is defined as follows. Given
a state-action pair s, the ith prototype pi, and a constant variance σ2, the membership grade
CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 49
Figure 3.6: Sample membership function for fuzzy Kanerva Coding.
of sa with respect to pi is
µi = e−||sa−pi||
2
2σ2 ,
where ||s − pi|| represents the bit difference between sa and pi. Note that the membership
grade of a prototype with respect to an identical state-action pair is 1, and the membership
grade of a state-action pair and a completely different prototype approaches 0.
The effect of an update ∆θ to a prototype’s θ-value is now a continuous function of the
bit difference ||sa−pi|| between the state-action pair s and the prototype pi. The update can
have a large effect on immediately adjacent prototypes, and a smaller effect on more distant
prototypes. Figure 3.6 gives an abstract description of the distribution of a state-action pairs
fuzzy membership grade with respect to each member of a set of prototypes.
In the adaptive Kanerva Coding algorithm described above, prototypes are updated based
CHAPTER 3. FUZZY LOGIC-BASED FUNCTION APPROXIMATION 50
on their visit frequencies. In fuzzy Kanerva Coding the visit frequency of each prototype is
identical, so we instead use membership grades which vary continuously from 0 to 1. If the
membership grade of a state-action pair with respect to a prototype tends to 1, we say that
the prototype is strongly adjacent to the state-action pair, Otherwise, the prototype is said
to be weakly adjacent to the state-action pair. The probability pupdate(sa) that a state-action
pair sa is chosen as a prototype is
pupdate(sa) = λe−λm(sa),
where λ is a parameter that can vary from 0 to 1, and where m(sa) is the sum of the
membership grades of state-action pair sa with respect to all prototypes. In this mecha-
nism, prototypes that are weakly adjacent to frequently-visited state-action pairs tend to
be probabilistically replaced by prootypes that are strongly adjacent to frequently-visited
Figure 4.3: The fraction of equivalence classes that contain two or more state-action pairsover all equivalence classes, the conflict rate, and its corresponding solution rate and colli-sion rate using traditional Kanerva and adaptive Kanerva with frequency-based prototypeoptimization across all sizes of grids
Figure 4.3 shows the fraction of equivalence classes that contain two or more state-
action pairs over all equivalence classes, the conflict rate, and its corresponding solution rate
and collision rate using traditional Kanerva and adaptive Kanerva with frequency-based
prototype optimization across all sizes of grids. These results show that as the fraction of
equivalence classes that contain two or more state-action pairs increases, the collision rate
increases and the performance of each algorithm decreases. For example, for the traditional
algorithm, the collision rate increases from 25.0% to 71.5% and the average solution rate
decreases from 93.1% to 40.6%, while the fraction of equivalence classes that contain two
or more state-action pairs increases from 27.6% to 79.5% as the size of the grid increases.
CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 70
For the adaptive algorithm, the collision rate increases from 8.5% to 29.5% and the average
solution rate decreases from 99.5% to 90.5%, while the fraction of equivalence classes that
contain two or more state-action pairs increases from 8.2% to 35.2% as the size of the grid
increases.
The results also demonstrate that the improved performance of the adaptive Kanerva
algorithm over the traditional algorithm is due to the reduction of the fraction of equivalence
classes that contain two or more state-action pairs. For example, the adaptive algorithm
reduces the fraction of equivalence classes that contain two or more state-action pairs from
79.5% to 35.2% while the average solution rate for the adaptive algorithm increases from
40.6% to 90.5% for a grid size of 32x32.
Figure 4.4 shows the fraction of prototypes remaining after performing a prototype reduct
using traditional and optimized Kanerva-based function approximation with 2000 proto-
types. The original and final number of prototypes is shown on each bar. The results
indicate that the structure of the equivalence classes can be maintained using fewer proto-
types. For example, the equivalence classes induced by 1821 prototypes for the adaptive
algorithm using frequency-based prototype optimization are same as the equivalence classes
induced by 2000 prototypes for a grid size of 32.
4.3 Rough Sets-based Kanerva Coding
A more reliable approach to prototype optimization for function approximation is to apply
rough sets theory to reformulate Kanerva-based function approximation. Instead of using
CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 71
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
Trad. Adap. Trad. Adap. Trad. Adap.
% of
origi
nal p
roto
type i
n red
uct
Grid size 8X8 16X16 32X32
821
/ 200
0
943
/ 200
0
1368
/ 20
00
1476
/ 20
00
1753
/ 20
00
1821
/ 20
00
Figure 4.4: The fraction of prototypes remaining after performing a prototype reduct usingtraditional and optimized Kanerva-based function approximation with 2000 prototypes. Theoriginal and final number of prototypes is shown on each bar.
visit frequencies for frequency-based prototype optimization, we focus on the structure of
equivalence classes induced by the set of prototypes, a key indicator of the efficiency of
function approximation. When the fraction of equivalence classes that contain two or more
state-action pairs increases, the performance of a reinforcement learner based on Kanerva
coding decreases. Since a prototype reduct maintains the equivalence class structure, pro-
totype deletion can be conducted by replacing the set of prototypes with a reduct of the
original prototype set. Since prototype collisions occur only when two state-action pairs are
in a same equivalence class, prototype generation should reduce the fraction of equivalence
classes that contain two or more state-action pairs.
CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 72
4.3.1 Prototype Deletion and Generation
In rough sets-based Kanerva coding, if the structure of equivalence classes remains un-
changed, the efficiency of function approximation is also unchanged. Replacing a set of
prototypes with its reduct clearly eliminates unnecessary prototypes. We therefore imple-
ment prototype deletion by finding a reduct R of original prototype set P . We refer to this
approach as reduct-based prototype deletion.
Note that a reduct of prototype set is not necessarily unique, and there may be many
subsets of prototypes which preserve the equivalence-class structure. The following algorithm
finds a reduct of original prototype set. We consider each prototype in P one by one. For
prototype p ∈ P , if the set of equivalence classes {EP−{p}} induced by P − {p} is not identical
to the equivalence classes {EP} induced by P , that is, {EP−{p}} 6= {EP}, then p is in a reduct
R of original prototype set P , p ∈ R; otherwise, p is not in the reduct R, p /∈ R. We then
delete p from prototype set P and consider the next prototype. The final set R is a reduct
of original prototype set P . We find a series of random reducts of the original prototype set,
then select a reduct with the fewest elements to be the replacement of original prototype
set. Reduct-based prototype optimization makes only a few passes through the prototypes
and is not time-consuming. With n state-action pairs and p prototypes, the complexity is
O(n ∗ p2). Once a prototype is deleted, the θ-value of this prototype is accumulated to the
nearest prototypes.
In rough sets-based Kanerva coding, if the number of equivalence classes that contain
only one state-action pair increases, prototype collisions are less likely and the efficiency of
CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 73
function approximation increases. An equivalence class that contains two or more state-
action pairs is likely to be split up by adding a new prototype equal to one of those state-
action pairs. We therefore implement prototype generation by adding new prototypes that
split equivalence classes with two or more state-action pairs. We refer to this approach as
equivalence class-based prototype generation.
For an arbitrary equivalence class that contains n > 1 state-ation pairs, we randomly
select dlog(n)e state-ation pairs to be new prototypes. Note that this value is the smallest
number of prototypes needed to distinguish all state-action pairs in an equivalence class that
contains n elements. This algorithm does not guarantee that each equivalence class will be
split into new classes that contain exactly one state-action pair. For example, this approach
cannot split an equivalence class with two neighboring state-action pairs. In this case, we
add such a new prototype that is a neighbor of one state-action pair, but not a neighbor of
the other.
4.3.2 Rough Sets-based Kanerva Coding Algorithm
Algorithm 2 describes our algorithm for implementing Q-learning with adaptive Kanerva
coding using rough sets-based prototype optimization. The algorithm begins by initializing
parameters, and repeatedly executes Q-learning with adaptive Kanerva Coding. Prototypes
are adaptively updated periodically. In each update period, the encountered state-action
pairs are recorded. To update prototypes, the algorithm first determines the structure of
the equivalence classes of the set of encountered state-action pairs with respect to original
CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 74
Algorithm 2 Pseudo code of Q-learning with rough sets-based Kanerva Coding
Main()
choose a set of prototypes ~p and initial their ~θ value;repeat
Generate initial state-action pair s from initial state ς and action aQ-with-Kanerva(s, a, ~p, ~θ)
Update-prototypes(~p, ~θ)until all episodes are traversed
Q-with-Kanerva(s, a, ~p, ~θ)repeat
Take action a, observe reward r, get next state ς’Q(sa) =
∑ ~θ;for all actions a* under new state ς’ do
Generate the state-action pair sa’ from state ς’ and action a*Q(sa′) =
∑ ~θ;end forδ = r + γ ∗maxQ(s′)−Q(sa)
∆~θ = α ∗ δ~θ = ~θ + ∆~θif random probability ≤ ε then
for all actions a* under current state s doQ̂(sa) =
∑ ~θ;a = argmaxaQ(sa)
end forelsea = random action
end ifuntil s is terminal
Update-prototypes(~p, ~θ)
Prototype-reduct-based-Deletion(~p, ~θ);
Equivalent-class-based-Generation(~p, ~θ).
CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 75
Prototype-reduct-based-Deletion(~p, ~θ)−−→E(~p) = equivalence classes induced by ~p~preduct = ~pfor i = 0 to 10 do~ptmp = ~prepeat~̂p = ~ptmp − p−−→E(~̂p) = equivalence classes induced by ~̂p
if−−→E(~̂p) =
−−→E(~p) then
~ptmp = ~̂pend if
until all prototypes p ∈ ~ptmp are traversed.if |~preduct| > |~ptmp| then~preduct = ~ptmp
end ifend for
Equivalent-class-based-Generation(~p, ~θ)repeatn = size(E(~p))if n > 1 then
if (n = 2) and (two state-action pairs sa1 and sa2 are neighbor) then~p = ~p
⋃{p|p = a neighbor of sa1, but not a neighbor of sa2}
elserepeat
randomly select a state-action pair sa~p = ~p
⋃{sa}
until dlog(n)e new prototypes are generated.end if
end if
until all equivalence classes E(~p) ∈−−→E(~p) are traversed.
CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 76
prototypes. Unnecessary prototypes are then deleted by replacing the original prototype set
with a reduct with the fewest elements among ten randomly-generated reducts. In order to
split large equivalence classes, new prototypes are randomly selected from these equivalence
classes. For the equivalence classes with two neighboring state-action pairs, a new prototype
is a neighbor of one state-action pair, but not a neighbor of the other. The optimized
prototype set is constructed by adding newly generated prototypes to the reduct of original
prototype set.
4.3.3 Performance Evaluation of Rough Sets-based Kanerva Cod-
ing
We evaluate the performance of rough sets-based Kanerva coding by using it to solve pursuit
instances on grids of varying sizes. As a comparison, traditional Kanerva coding and adaptive
Kanerva coding with different number of prototypes are also applied to the same instances.
Traditional Kanerva coding follows Sutton [38]. Kanerva coding with adaptive prototype
optimization is implemented using prototype deletion and prototype splitting. A detailed
description of prototype deletion and splitting can be found in Section 3.3. When rough
sets-based Kanerva coding is implemented during a learning process, we also observe the
change in the number of prototypes and the fraction of equivalence classes that contain only
one state-action pair.
Figure 4.5 shows the average fraction of test instances solved when traditional Kanerva,
adaptive Kanerva, and rough sets-based Kanerva are applied to our instances with grids of
CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 77
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Aver
age
solu
tion
rate
Epoch
Rough Set-based Optimization, 568 prototypes Frequence-based Optimization with 2000 Prototypes Frequence-based Optimization with 568 Prototypes No Optimization with 2000 Prototypes
(a)
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Aver
age
solu
tion
rate
Epoch
Rough Set-based Optimization, 955 Prototypes Frequence-based Optimization with 2000 Prototypes Frequence-based Optimization with 955 Prototypes No Optimization with 2000 Prototypes
(b)
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Aver
age
solu
tion
rate
Epoch
Rough Set-based Optimization, 1968 Prototypes Frequence-based Optimization with 2000 Prototypes Frequence-based Optimization with 1968 Prototypes No Optimization with 2000 Prototypes
(c)
Figure 4.5: Average solution rate for traditional Kanerva, adaptive Kanerva and rough sets-based Kanerva (a), in 8x8 grid; (b), in 16x16 grid; (c), in 32x32 grid.
CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 78
Table 4.3: Percentage improved performance of rough sets-based Kanerva over adaptiveKanerva.
size 8, 16 and 32. The results show that the rough sets-based algorithm increases the fraction
of test instances solved over adaptive Kanerva algorithm when using the same number of
prototypes. For example, after 2000 epochs, using the rough sets-based algorithm increases
the fraction of test instances solved over the adaptive algorithm from 87.6% to 99.4% in the
8x8 grid, from 73.4% to 98.0% in the 16x16 grid and from 81.1% to 92.8% in the 32x32
grid, respectively . The results when using grids of varying sizes indicate that rough sets-
based Kanerva coding uses fewer prototypes and achieves higher performance by adaptively
changing the number and allocation of prototypes.
Table 4.3 shows the percentage improved performance using rough sets-based Kanerva
over adaptive Kanerva across varying grid sizes. The results show that the improved per-
formance of the rough sets-based approach is consistently more than 10% better than the
adaptive approach with different grids of size. It indicates that our rough sets-based approach
can reliably improve a Kanerva-based reinforcement learner’s ability.
Figure 4.6 shows the effect of our rough sets-based Kanerva coding on the number of
prototypes and the corresponding change in the fraction of equivalence classes that contain
only one state-action pair in the grid of size 8, 16 and 32.. The results show that the
rough sets-based algorithm reduces the number of prototypes and increases the fraction of
equivalence classes with only one state-action pair. For example, after 2000 epochs, the rough
CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 79
2000 Prototypes
568 Prototypes
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0
500
1000
1500
2000
2500
0 200 400 600 800 1000 1200 1400 1600 1800 2000
% o
f equ
ival
ence
cla
sses
that
con
tain
onl
y on
e st
ate-
actio
n pa
ir
Num
ber o
f pro
toty
pes
Epoch
(a)
2000 Prototypes
955 Prototypes
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0
500
1000
1500
2000
2500
0 200 400 600 800 1000 1200 1400 1600 1800 2000
% o
f equ
ival
ence
cla
sses
that
con
tain
onl
y on
e st
ate-
actio
n pa
ir
Num
ber o
f pro
toty
pes
Epoch
(b)
2000 Prototypes 1968 Prototypes
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0
500
1000
1500
2000
2500
0 200 400 600 800 1000 1200 1400 1600 1800 2000
% o
f equ
ival
ence
cla
sses
that
con
tain
onl
y on
e st
ate-
actio
n pa
ir
Num
ber o
f pro
toty
pes
Epoch
(c)
Figure 4.6: Effect of rough sets-based Kanerva on the number of prototypes and the fractionof equivalence classes (a), in 8x8 grid; (b), in 16x16 grid; (c), in 32x32 grid.
CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 80
sets-based algorithm reduces the number of prototypes to 568, 955 and 1968 prototypes, and
increases the fraction of equivalence classes with one state-action pair to 99.5%, 99.8% and
94.9% in the grid of size 8, 16 and 32, respectively. These results also demonstrate that
rough sets-based Kanerva can adaptively explore the optimal number of prototypes and
dynamically allocate prototypes for optimal structure of equivalence classes in a particular
application.
4.4 Effect of Varying the Number of Initial Prototypes
The accuracy of Kanerva-based function approximation is sensitive to the number of pro-
totypes. In general, more prototypes are needed to approximate the state-action space for
more complex applications. On the other hand, the computational complexity of Kanerva
Coding also depends entirely on the number of prototypes, and larger sets of prototypes
can more accurately approximate more complex spaces. Neither traditional Kanerva nor
adaptive Kanerva can adaptively select the number of prototypes. Therefore, the number
of prototypes has a significant effect on the efficiency of traditional and adaptive Kanerva
coding. If the number of prototypes is too large relative to the number of state-action pairs,
the implementation of Kanerva coding is unnecessarily time-consuming. If the number of
prototypes is too small, even if the prototypes are well chosen, the approximate values will
not be similar to true values and the reinforcement learner will give poor results. Select-
ing the appropriate number of prototypes is difficult for traditional and adaptive Kanerva
coding, and in most known applications of these algorithms the number of prototypes is
CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 81
2000 Prototypes
975 Prototypes
1500 Prototypes
1000 Prototypes
500 Prototypes
250 Prototypes
0 Prototype
922 Prototypes
0
500
1000
1500
2000
2500
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Num
ber o
f pro
toty
pes
Epoch
Figure 4.7: Variation in the number of prototypes with different numbers of initial prototypeswith rough sets-based Kanerva in a 16x16 grid.
selected manually. However, for a particular application, the set of observed state-action
pairs is limited to a fixed subset of all possible state-action pairs. The number of prototypes
needed to distinguished this set of state-action pairs is also fixed.
We are interested in investigating the effect of different number of initial prototypes using
rough sets-based Kanerva coding. We use our rough sets-based algorithm with 0, 250, 500,
1000, 1500 or 2000 initial prototypes to solve pursuit instances in the 16 X 16 grid. Figure 4.7
shows the effect of our algorithm on the number of prototypes. The results show that the
number of prototypes tends to converge to a fixed number in the range from 922 to 975 after
2000 epochs. The results demonstrate that our rough sets-based Kanerva coding has the
ability to adaptively determine an effective number of prototypes during a learning process.
CHAPTER 4. ROUGH SETS-BASED FUNCTION APPROXIMATION 82
4.5 Summary
Kanerva Coding can be used to improve the performance of function approximation within
reinforcement learners. This approach often gives poor performance when applied to large-
scale systems. We evaluated a collection of pursuit instances of the predator-prey problem
and argued that the poor performance is caused by inappropriate selection of the prototypes,
including the number and allocation of these prototypes. We also showed that adaptive
Kanerva coding can give better results by dynamically allocating the prototypes. However
the number of prototypes remains hard to select and the performance was still poor because
of inappropriate number of prototypes. It was therefore necessary to consider a more effective
approach for adaptively selecting the number of prototypes.
Our new rough sets-based Kanerva-based function approximation uses rough sets theory
to reformulate prototype set and its implementation in Kanerva Coding. This approach uses
the structure of equivalent classes to explain how prototype collisions occur. Our algorithm
eliminates unnecessary prototypes by replacing the original prototype set with its reduct, and
reduces prototype collisions by splitting equivalence classes with two or more state-action
pairs. Our results indicate that rough sets-based Kanerva coding can adaptively select an
effective number of prototypes and greatly improve a Kanerva-based reinforcement learner’s
ability to solve large-scale problems.
Chapter 5
Real-world Application: Cognitive
Radio Network
5.1 Introduction
Radio frequency spectrum is a scarce resource. In many countries, the governmental agencies,
e.g. the Federal Communications Commission (FCC) in the United States, assign spectrum
bands to specific operators or devices to prevent them from being used by unlicensed users.
However, much of these assigned bands depend strongly on time and place, and often are
rarely used. Recent studies have demonstrated that much of the radio frequency spectrum
is inefficiently utilized [35, 5]. To address this issue, the FCC has recently recently begun to
allow unlicensed users to utilize licensed bands whenever it would not cause any interference
83
CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 84
[1]. Therefore, dynamic spectrum management techniques are needed to improve the effi-
ciency of spectrum utilization [5, 18, 29]. The development of these techniques motivates a
novel research area of cognitive radio (CR) networks.
The key idea of CR networks is that the unlicensed devices (also called cognitive radio
users) detect vacant spectrum and utilize it without harmful interference with licensed de-
vices (also known as primary users). This approach requires that CR networks have the
ability to sense spectrum holes and capture the best transmission parameters to meet the
quality-of-service (QoS) requirements. However, in a real-world ad-hoc networks, dynamic
network topology and varying spectrum availability on different time slots and locations pose
a critical challenge for CR networks.
Recent studies have shown that applying theoretical research on multi-agent reinforce-
ment learning to spectrum management in CR network is a feasible approach for meeting
the challenge [50]. Since a CR network must have sufficient computational intelligence to
choose its appropriate transmission parameters based on external network environment, it
must be capable of learning from its historical experience, and adapting its behavior to the
current context. This approach works well to solve small topology networks. However, it
often gives poor performance when applied to large-scale networks. These networks typi-
cally have a very large number of unlicensed and licensed users, and a wide range of possible
transmission parameters. The experimental results have shown that the performance of CR
networks decreases sharply as the size of network increases [50]. There is therefore a need for
algorithms to apply function approximation techniques to scale up reinforcement learning
CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 85
Cognitive Radio Ad Hoc Networks
PU
Licensed Band 1
Unlicensed Band
Licensed Band 2
CR Users
Primary Networks
Figure 5.1: The CR ad hoc architecture
for large-scale cognitive radio networks.
Our work focuses on cognitive radio ad hoc networks with decentralized control [4]. The
architecture of a CR ad hoc network, shown in Figure 5.1 [50], can be partitioned into two
groups of users: the primary network and the CR network components. The primary network
is composed of primary users (PUs) that have a license to operate in a certain spectrum
band. The CR network is composed of cognitive radio users (CR users) that share wireless
channels with licensed users that already have an assigned spectrum. Under this architecture,
the CR users need to continuously monitor spectrum for the presence of the primary users
and reconfigure the radio front-end according to the demands and requirements of the higher
layers. This capability can be realized, as shown in Figure 5.2 [50], by the cognitive cycle
CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 86
Spectrum
Decision
Spectrum
Sharing
Radio
EnvironmentTransmitted
Signal
Spectrum
Characterization
PU Detection
RF Stimuli
Spectrum Hole
Decision Request
Spectrum
Channel Capacity
Mobility
Spectrum
Sensing
Figure 5.2: The cognitive radio cycle for the CR ad hoc architecture
composed of the following spectrum functions: (1) determining the portions of the spectrum
currently available (Spectrum sensing), (2) selecting the best available channel (Spectrum
decision), (3) coordinating access to this channel with other users (Spectrum sharing), and
(4) effectively vacating the channel when a licensed user is detected (Spectrum mobility).
In this chapter, we describe a reinforcement learning-based solution that allows each
sender-receiver pair to locally adjust its choice of spectrum and transmit power, subject to
connectivity and interference constraints. We model this as a multi-agent learning system,
where each action, i.e. choice of power level and spectrum, earns a reward based on the
utility that is maximized. We first evaluate the reinforcement learning-based approach, and
show that it works well for small topology networks and performs poorly for large topology
networks. We argue that large-scale cognitive radio wireless networks are typically difficult to
CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 87
solve using reinforcement learning because of huge state-action space. Thus, using a smaller
approximation value table instead of the original state-action value table is necessary for
a real cognitive radio wireless network. We then apply function approximation techniques
to reduce the size of state-action value table. We conclude that our function approxima-
tion technique can scale up the ability of the reinforcement learning based cognitive radio
approach.
5.2 Reinforcement Learning-Based Cognitive Radio
5.2.1 Problem Formulation
In this chapter, we assume that our network consists of a collection of PUs and CR users,
each of which is paired with another user to form transmitter-receiver pairs. The PUs
exist in a spatially overlapped region with the nodes of the wireless network. The CR
users undertake decisions on choosing the spectrum and transmission power independently
of the others in the neighborhood. We also assume perfect sensing in which the CR user
correctly infers the presence of the PU if the former lies within the PU’s transmission range.
Moreover, the CR users can also detect, in the case of collision, if the colliding node is a PU
transmitter, or another CR user. We model this by keeping the PU transmit power an order
of magnitude higher than the CR user’s power, which is realistic in contexts such as the use
of TV transmitters. If the receiver, while performing energy detection, observes the received
signal energy at a level several multiples greater than the CR user-only case, it identifies a
CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 88
collision with the PU, and relays this condition back to the sender via an out-of-band control
channel. As the PU receiver location is unknown (and hence, if there was a collision at the
PU receiver location due to the concurrent sensor transmission) all such cases are flagged
as PU interference. Thus, our approach is conservative, and it overestimates the effect of
interference to the PU to safeguard its performance.
A choice of spectrum by CR user i is essentially the choice of the frequency represented
by F i ∈ ~F , the set of available frequencies. The CR users continuously monitor the spectrum
that they choose in each time slot. The channels chosen are discrete, and a jump from any
channel to another is possible in consecutive time slots.
The transmit power chosen by the CR user i is given by P itx. The transmission range and
interference range are represented by Rt and Ri, respectively. Our simulator uses the free-
space path loss equation to calculate the attenuated power incident at the receiver, denoted
P jrx. Thus,
P jrx = α · P i
tx
{Di}−β
,
where the path loss exponent β = 2 and the the speed of light c = 3 × 108m/s. The
power values chosen are discrete, and a jump from any given value to another is possible in
consecutive time slots.
5.2.2 Application to cognitive radio
In cognitive radio network, if we consider each cognitive user to be an agent and the wireless
network to be the external environment, cognitive radio can be formulated as a system in
CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 89
RadioEnvironment
Mul1pleCogni1veUsers
Spectrum Decision
Spectrum Mobility
Spe
ctru
m S
harin
g Spectrum
Sensing
Figure 5.3: Multi-agent reinforcement learning based cognitive radio.
which communicating agents sense their environment, learn, and adjust their transmission
parameters to maximize their communication performance. This formulation fits well within
the context of reinforcement learning.
Figure 5.3 gives an overview of how we apply reinforcement learning to cognitive radio.
Each cognitive user acts as an agent using reinforcement learning. These agents do spectrum
sensing and perceive their current states, i.e., spectra and transmission powers. They then
make spectrum decisions and use spectrum mobility to choose actions, i.e. switch channels
or change their power value. Finally, the agents use spectrum sharing to transmit signals.
CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 90
Through interaction with the radio environment, these agents receive transmission rewards
which are used as the inputs for the next sensing and transmission cycle.
A state in reinforcement learning is some information that an agent can perceive within
the environment. In RL-based cognitive radio, the state of an agent is the current spectrum
and power value of its transmission. The state of the multi-agent system includes the state
of every agent. We therefore define the state of the system at time t, denoted st, as
st = (~F , ~Ptx)t,
where ~F is a vector of spectra and ~Ptx is a vector of power values across all agents. Here
F i and P itx are the spectrum and power value of the ith agent and Fi ∈ ~F and P i
tx ∈ ~Ptx.
Normally, if there are M spectra and N power values, we can using the index to specify these
spectra and power values. In this way, we have ~F = {1, 2, ...,m} and ~Ptx = {1, 2, ..., n}.
An action in reinforcement learning is the behavior of an agent at a specific time at a
specific state. In RL-based cognitive radio, an action a allows an agent to either switch from
its current spectrum to a new available spectrum in ~F , or switch from its current power
value to a new available power value in ~Ptx. Here we define action at at time t as
at = (~k)t,
where ~k is a vector of actions across all agents. Here ki is the action of the ith agent and
ki ∈ {jump spectrum, jump power}.
A reward in reinforcement learning is a measure of the desirability of an agent’s action at
a specific state within the environment. In RL-based cognitive radio, the reward r is closely
CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 91
Succesful Tx.
Channel Error
Packet Collision
WCSN PU Interference
Link Disconnection
ModeratePositive
ModerateNegative
HighNegative
0
Rew
ard
Figure 5.4: Comparative reward levels for different observed scenarios
related to the performance of the network. The rewards for the following different network
conditions are shown in Figure 5.4 [50]:
• CR-PU interference: If primary user (PU) transmits signals in the same time slot and
in the same spectrum used by the CR users, then we impose a high penalty of −15.
The intuition of the heavy negative reward follows the basic communication principle
that the usage of spectrum of the licensed devices should be strictly guaranteed.
• Intra-CR network Collision: If a packet suffers a collision with another concurrent
CR user transmission, then a penalty of −5 is imposed. The intuition of the light
negative reward follows the principle that collisions among the CR users lower the link
CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 92
throughput, which should be avoided. The comparatively low penalty to the CR users
arising from intra-network collisions aims to force fair sharing of the available spectrum
by encouraging the users to choose distinct spectrum bands, if available.
• Channel Induced Errors: If a transmitted packet suffers any channel induced error,
then we impose a penalty of −5. The intuition of the light negative reward follows the
principle that certain spectrum bands are more robust to channel errors owing to their
lower attenuation rates. By preferring the spectrum bands with the lowest packet error
rate (PER), the CR users reduce re-transmissions and associated network delays.
• Link Disconnection: If the received power (P jrx) is less than the threshold of the receiver
Prth (here, assumed as −85 dBm), then all the packets are dropped, and we impose a
steep penalty of −20. Thus, the sender should quickly increase its choice of transmit
power so that the link can be re-established.
• Successful Transmission: If none of the above conditions are observed to be true in
the given transmission slot, then packet is successfully transmitted from the sender to
receiver, and a reward of +5 is assigned.
In this way, we can apply multi-agent reinforcement learning to solve cognitive radio
problem [50].
CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 93
(Application Layer)
SensingSpectrumSharing
Reinforcement Learning Module
Spectrum Spectrum
CR Link Layer Module
**
SpectrumNeighborList
* Tx Power
CR Physical Layer Module
Block
SpectrumBlock
Cross LayerRepository
PU Activity
(MAC Functionality)
Management
Figure 5.5: Block diagram of the implemented simulator tool for reinforcement learningbased cognitive radio.
5.3 Experimental Simulation
In this section, we describe preliminary results from applying multi-agent reinforcement
learning to our cognitive radio model. The overall aim of our proposed learning based
approach is to allow the CR users (i.e., agents) to decide on an optimal choice of transmission
power and spectrum so that (i) PUs are not affected, and (ii) CR users share the spectrum
in a fair manner.
5.3.1 Simulation Setup
A novel CR network simulator described in Section 4.1 has been designed to investigate the
effect of the proposed reinforcement learning technique on the network’s operation. As shown
in Figure 5.5, our implemented ns-2 model [50] is composed of several modifications to the
CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 94
physical, link and network layers in the form of stand-alone C++ modules. The PU Activity
Block describes the activity of PUs based on the on-off model, including their transmission
range, location, and spectrum band of use. The Spectrum Block contains a channel table
with the background noise, capacity, and occupancy status. The Spectrum Sensing Block
implements the energy-based sensing functionalities, and if a PU is detected, the Spectrum
Management Block is notified. This, in turn causes the device to switch to the next available
channel, and also alert the upper layers of the change of frequency. The Spectrum Sharing
Block coordinates the distributed channel access, and calculates the interference at any given
node due to the ongoing transmissions in the network. The Cross Layer Repository facilitates
the information sharing between the different protocol stack layers.
We have conducted a simulation study on two topologies: a 3 × 3 grid network with
a total of 9 CR users (the small topology), and a random deployment of varying CR users
distributed in a square area of 1000m side (the “real-world” topology). In the small topology,
we assume 4 spectrum bands, given by the set F = {50 MHz, 500 MHz, 2 GHz, and 5 GHz},
and 4 transmit power values. There are a total of 2 PUs.
In the “real-world” topology, we assume 100 spectrum bands, chosen in the range from
50 MHz to 5 GHz, and 20 transmit power values, which are uniformly distributed between
0.5 mW to 4 mW. There are a total of 25 PUs. Each PU is randomly assigned one default
channel in which it stays with probability 0.4. It can also switch to three other pre-chosen suc-
cessively placed channels with the decreasing probability {0.3, 0.2, 0.1}, respectively. Thus,
the PU has an underlying distribution with which it is active on a given channel, but this
CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 95
is unknown to the CR user. Transmission in the CR network occurs on multiple sets of
pre-decided node pairs, each such pair forming a link represented as (i, j). The terms in
the parenthesis denote the directional transmission from the sender i to the receiver j. The
choice of spectrum is made by the sender node, and is communicated to the receiver over the
common control channel or CCC. This CCC is also used to return feedback to the sender
regarding possible collisions that may be experienced by the receiver. However, data trans-
mission occurs exclusively in the spectrum chosen by the node pair forming the link. We
consider the time to be slotted, and the link layer at each sender node attempts to transmit
with a probability p = 0.2 in every slot.
We compare the performance of our reinforcement learning based (RL-based) scheme
with three other schemes: (i) random assignment, which selects a random combination of
spectrum and power in each round; (ii) greedy assignment with history 1 (G-1), and (iii)
greedy assignment with history 20 (G-20). The G-1 algorithm stores for every possible
spectrum and power combination the reward received the last time that combination was
selected (if any). The algorithm selects the combination with the highest previous reward
with probability η and explores a randomly chosen combination with probability (1−η). The
G-20 algorithm maintains a repository of the reward obtained in the 20 past slots for every
combination of power and spectrum, and selects the best combination in the past 20 slots.
Similar to G-1, G-20 selects the best known combination from the history with η = 0.8, and
explores a randomly chosen one with probability (1−η) = 0.8. In our RL-based scheme, the
exploration rate ε is set to 0.2, which we found experimentally to give the best results. The
CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 96
initial learning rate α is set to 0.8, and it is decreased by a factor of 0.995 after each time
slot. Note that G-1 uses the same amount of memory as the RL-based scheme, but G-20
uses 20 times more memory.
5.3.2 Simulation Evaluation
We apply the four schemes, i.e. random, G-1, G-20, and RL-based, to the small topologies.
We collect the results over 30000 time slots, and record the average probabilities of successful
transmission, the average rewards of CR users, and the average number of channel switches
by CR users. We then plot these values over time. Each experiment is performed 5 times
and we report the means and standard deviations of the recorded values. In our experiments,
all runs were found to converge within 30,000 epochs.
Figure 5.6(a) shows the average probability of successful transmission when applying the
four schemes to the small topology. The results show that the RL-based scheme transmits
successful packets with an average probability of approximately 97.5%, while the G-20, G-1
and random schemes transmit successful packets with average probabilities of approximately
88.2%, 79.4%, and 48.7%, respectively. The results indicate that after learning, the RL-
based approach can effectively guarantee successful transmissions, and its performance is
much better than the others, including the G-20 scheme which uses more than an order of
magnitude more memory.
Figure 5.6(b) shows the corresponding average rewards received by CR users when ap-
plying the four schemes to the small topology. The results show that after learning, the
CHAPTER 5. REAL-WORLD APPLICATION: COGNITIVE RADIO NETWORK 97