Top Banner
Citation: Fernández-Conde, J.; Cuenca-Jiménez, P.; Cañas, J.M. Hybrid Training Strategies: Improving Performance of Temporal Difference Learning in Board Games. Appl. Sci. 2022, 12, 2854. https:// doi.org/10.3390/app12062854 Academic Editors: Yujin Lim and Hideyuki Takahashi Received: 3 February 2022 Accepted: 8 March 2022 Published: 10 March 2022 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affil- iations. Copyright: © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). applied sciences Article Hybrid Training Strategies: Improving Performance of Temporal Difference Learning in Board Games Jesús Fernández-Conde * , Pedro Cuenca-Jiménez and José M. Cañas Department of Telematic Systems and Computation, Rey Juan Carlos University, Fuenlabrada, 28942 Madrid, Spain; [email protected] (P.C.-J.); [email protected] (J.M.C.) * Correspondence: [email protected] † Submission is extension of conference paper: Fernández-Conde J., Cuenca-Jiménez P., Cañas J.M. (2020) An Efficient Training Strategy for a Temporal Difference Learning Based Tic-Tac-Toe Automatic Player. In: Smys S., Bestak R., Rocha Á. (eds) Inventive Computation Technologies, International Conference on Inventive Computation Technologies ICICIT 2019. Lecture Notes in Networks and Systems, vol 98. Springer, Cham, https://doi.org/10.1007/978-3-030-33846-6_47. Abstract: Temporal difference (TD) learning is a well-known approach for training automated players in board games with a limited number of potential states through autonomous play. Because of its directness, TD learning has become widespread, but certain critical difficulties must be solved in order for it to be effective. It is impractical to train an artificial intelligence (AI) agent against a random player since it takes millions of games for the agent to learn to play intelligently. Training the agent against a methodical player, on the other hand, is not an option owing to a lack of exploration. This article describes and examines a variety of hybrid training procedures for a TD-based automated player that combines randomness with specified plays in a predetermined ratio. We provide simulation results for the famous tic-tac-toe and Connect-4 board games, in which one of the studied training strategies significantly surpasses the other options. On average, it takes fewer than 100,000 games of training for an agent taught using this approach to act as a flawless player in tic-tac-toe. Keywords: reinforcement learning; temporal difference learning; automatic player; board games; hybrid training; tic-tac-toe; Connect-4 1. Introduction Reinforcement learning (RL) is the study of how artificial intelligence (AI) agents may learn what to do in a particular environment without having access to labeled examples [1,2]. Without any prior knowledge of the environment or the reward function, RL employs perception and observed rewards to develop an optimal (or near-optimal) policy for the environment. RL is a classic AI issue in which an agent is introduced in an unfamiliar environment and therefore must learn to act successfully in it. The RL agent learns to play board games by obtaining feedback (reward, reinforce- ment) after the conclusion of each game [3], knowing that something good has happened after winning or something bad has happened after losing. This agent acts without having any prior knowledge of the appropriate strategies to win the game. It is difficult for a person to make precise and consistent assessments of a large number of locations while playing a board game, as would be required if an evaluation function were discovered directly from instances. An RL agent, on the other hand, can learn an evaluation function that yields relatively accurate estimations of the likelihood of winning from any given position by knowing whether it won or lost each game played. Due to its simplicity and low computational requirements, temporal difference learn- ing (TD), a type of reinforcement learning (RL), has been widely used to successfully train automatic game players in board games such as checkers [4], backgammon [57], tic-tac- toe [8], Chung Toi [9,10], go [11,12], or Othello [13]. The TD agent models a fully observable Appl. Sci. 2022, 12, 2854. https://doi.org/10.3390/app12062854 https://www.mdpi.com/journal/applsci
16

Improving Performance of Temporal Difference Learning in ...

May 04, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Improving Performance of Temporal Difference Learning in ...

�����������������

Citation: Fernández-Conde, J.;

Cuenca-Jiménez, P.; Cañas, J.M.

Hybrid Training Strategies:

Improving Performance of Temporal

Difference Learning in Board Games.

Appl. Sci. 2022, 12, 2854. https://

doi.org/10.3390/app12062854

Academic Editors: Yujin Lim and

Hideyuki Takahashi

Received: 3 February 2022

Accepted: 8 March 2022

Published: 10 March 2022

Publisher’s Note: MDPI stays neutral

with regard to jurisdictional claims in

published maps and institutional affil-

iations.

Copyright: © 2022 by the authors.

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license (https://

creativecommons.org/licenses/by/

4.0/).

applied sciences

Article

Hybrid Training Strategies: Improving Performance ofTemporal Difference Learning in Board Games †

Jesús Fernández-Conde * , Pedro Cuenca-Jiménez and José M. Cañas

Department of Telematic Systems and Computation, Rey Juan Carlos University, Fuenlabrada,28942 Madrid, Spain; [email protected] (P.C.-J.); [email protected] (J.M.C.)* Correspondence: [email protected]† Submission is extension of conference paper: Fernández-Conde J., Cuenca-Jiménez P., Cañas J.M. (2020) An

Efficient Training Strategy for a Temporal Difference Learning Based Tic-Tac-Toe Automatic Player. In: Smys S.,Bestak R., Rocha Á. (eds) Inventive Computation Technologies, International Conference on InventiveComputation Technologies ICICIT 2019. Lecture Notes in Networks and Systems, vol 98. Springer, Cham,https://doi.org/10.1007/978-3-030-33846-6_47.

Abstract: Temporal difference (TD) learning is a well-known approach for training automated playersin board games with a limited number of potential states through autonomous play. Because of itsdirectness, TD learning has become widespread, but certain critical difficulties must be solved in orderfor it to be effective. It is impractical to train an artificial intelligence (AI) agent against a random playersince it takes millions of games for the agent to learn to play intelligently. Training the agent against amethodical player, on the other hand, is not an option owing to a lack of exploration. This articledescribes and examines a variety of hybrid training procedures for a TD-based automated player thatcombines randomness with specified plays in a predetermined ratio. We provide simulation resultsfor the famous tic-tac-toe and Connect-4 board games, in which one of the studied training strategiessignificantly surpasses the other options. On average, it takes fewer than 100,000 games of trainingfor an agent taught using this approach to act as a flawless player in tic-tac-toe.

Keywords: reinforcement learning; temporal difference learning; automatic player; board games;hybrid training; tic-tac-toe; Connect-4

1. Introduction

Reinforcement learning (RL) is the study of how artificial intelligence (AI) agents maylearn what to do in a particular environment without having access to labeled examples [1,2].Without any prior knowledge of the environment or the reward function, RL employsperception and observed rewards to develop an optimal (or near-optimal) policy for theenvironment. RL is a classic AI issue in which an agent is introduced in an unfamiliarenvironment and therefore must learn to act successfully in it.

The RL agent learns to play board games by obtaining feedback (reward, reinforce-ment) after the conclusion of each game [3], knowing that something good has happenedafter winning or something bad has happened after losing. This agent acts without havingany prior knowledge of the appropriate strategies to win the game.

It is difficult for a person to make precise and consistent assessments of a large numberof locations while playing a board game, as would be required if an evaluation functionwere discovered directly from instances. An RL agent, on the other hand, can learn anevaluation function that yields relatively accurate estimations of the likelihood of winningfrom any given position by knowing whether it won or lost each game played.

Due to its simplicity and low computational requirements, temporal difference learn-ing (TD), a type of reinforcement learning (RL), has been widely used to successfully trainautomatic game players in board games such as checkers [4], backgammon [5–7], tic-tac-toe [8], Chung Toi [9,10], go [11,12], or Othello [13]. The TD agent models a fully observable

Appl. Sci. 2022, 12, 2854. https://doi.org/10.3390/app12062854 https://www.mdpi.com/journal/applsci

Page 2: Improving Performance of Temporal Difference Learning in ...

Appl. Sci. 2022, 12, 2854 2 of 16

environment using a state-based representation. The agent’s policy is static, and its job is tofigure out what the utility values of the various states are. The TD agent is unaware of thetransition model, which provides the likelihood of arriving at a given state from anotherstate after completing a particular action, as well as the reward function, which specifiesthe reward for each state.

The updates performed by TD learning do not need the use of a transition model. Theenvironment, in the form of witnessed transitions, provides the link between nearby states.The TD technique aims to make local modifications to utility estimates for each state’ssuccessors to “agree”.

TD modifies a state to agree just with its observed successor (as defined by the TDupdate rule equation [14]), rather than altering it to agree with all possible successors,weighted by their probability. When the effects of TD modifications are averaged acrossa large number of transitions, this approximation turns out to be acceptable since thefrequency of any successor in the set of transitions is approximately proportional to itsprobability. For back-propagating the evaluations of successive positions to the presentposition, TD learning utilizes the difference between two successive positions [15]. Becausethis is done for all positions that occur in a game, the game’s outcome is included in theevaluation function of all positions, and as a result, the evaluation function improvesover time.

The exploration–exploitation conundrum [16] is the most important question to over-come in TD learning: if an agent has devised a good line of action, should it continue todo so (exploiting what it has discovered), or should it explore to find better actions? Tolearn how to act better in its surroundings, an agent should immerse itself in it as much asfeasible. As a result of this, several past studies have shown disheartening results, requiringmillions of games before the agent begins to play intelligently.

Coevolutionary TD learning [17] and n-tuple systems [18,19] have been presented asextensions to the basic TD learning method. Self-play was used to train a TD learning agentfor the game Connect-4 in [20]. The time-to-learn was determined to be 1.5 million games(with an 80% success rate).

Other expansions and changes to the original TD method have been proposed, in-cluding the use of TD in conjunction with deep learning and neural networks [21]. Thesestudies show promising outcomes at the risk of overlooking the major two benefits of TDlearning: its relative simplicity and low computational costs.

This article presents and evaluates several hybrid techniques to train a TD learningagent in board games, extending the work presented in [22]. Exploration concerns aresolved by the incorporation of hybridness into training strategies: pure random gamesare included in a certain proportion in every training strategy considered, and they areinterspersed evenly during the training process.

The TD agent evaluated in this study employs the original TD method without anymodifications, keeping the TD implementation’s simplicity and minimal processing re-quirements. One of the training alternatives considered proves to outperform all the others,needing fewer than 100,000 training games on average to behave like a perfect player inthe tic-tac-toe classic board game, achieving this degree in all simulation runs. In addition,the same technique has been tested in the Connect-4 board game, also yielding a significantperformance improvement.

The rest of this article is arranged as follows: In Section 2, we analyze the key propertiesof the TD agent in order to learn how to play the famous board games tic-tac-toe andConnect-4. The experiments that were carried out and the results that were obtainedare described in the Section 3. In Section 4, the results are interpreted in-depth. Finally,Section 5 summarizes the major findings of this study.

Page 3: Improving Performance of Temporal Difference Learning in ...

Appl. Sci. 2022, 12, 2854 3 of 16

2. Materials and Methods2.1. Tic-Tac-Toe Board Game

Tic-tac-toe [23] is a traditional two-player (X and O) board game in which playersalternate marking places on a 3 × 3 grid. The game is won by the person who can line upthree of their markings in a horizontal, vertical, or diagonal row. Figure 1 shows a typicaltic-tac-toe game, which was won by the first player (X).

Appl. Sci. 2022, 12, x FOR PEER REVIEW 3 of 17

2. Materials and Methods 2.1. Tic-Tac-Toe Board Game

Tic-tac-toe [23] is a traditional two-player (X and O) board game in which players alternate marking places on a 3 × 3 grid. The game is won by the person who can line up three of their markings in a horizontal, vertical, or diagonal row. Figure 1 shows a typical tic-tac-toe game, which was won by the first player (X).

Figure 1. A finished game of tic-tac-toe, won by player X.

At first impression, it appears that the first player has an advantage (he or she can occupy one spot more than his or her opponent at the end of the game). Because of the intrinsic asymmetry of the tic-tac-toe game, the player that starts the game (player X) has almost twice as many winning final positions as the second player (player O).

A game between two flawless players, on the other hand, has been proven to always result in a tie. Furthermore, no matter how strongly the O-player tries to win, a perfect X-player can select any square as the opening move and still tie the game.

The tic-tac-toe game's state space is moderately narrow, with 39 (=19,683) potential states, albeit most of these are illegal owing to game constraints. The number of attainable states is 5478, according to combinatorial analysis.

Some basic strategies to play the game intelligently are detailed in [24], combining flexibility to adapt to changing situations and stability to satisfy long-term goals.

2.2. Connect-4 Board Game Connect-4 is a popular board game for two players played on a board with several

rows and columns. One main characteristic of the game is the vertical arrangement of the board: both players in turn drop one of their pieces into one of the columns (slots). Due to the gravity rule, the pieces fall to the lowest free position of the respective slot. The goal of both opponents is to create a line of four pieces with their color, either horizontally, vertically, or diagonally. The player who can achieve this first wins the game. If all the board positions become filled and none of the opponents was able to create a line of four own pieces, the match ends with a tie.

In Figure 2, we can observe a typical Connect-4 game for a 4 × 4 (four rows, four columns) board, won by the first player (blue):

Figure 1. A finished game of tic-tac-toe, won by player X.

At first impression, it appears that the first player has an advantage (he or she canoccupy one spot more than his or her opponent at the end of the game). Because of theintrinsic asymmetry of the tic-tac-toe game, the player that starts the game (player X) hasalmost twice as many winning final positions as the second player (player O).

A game between two flawless players, on the other hand, has been proven to alwaysresult in a tie. Furthermore, no matter how strongly the O-player tries to win, a perfectX-player can select any square as the opening move and still tie the game.

The tic-tac-toe game’s state space is moderately narrow, with 39 (=19,683) potentialstates, albeit most of these are illegal owing to game constraints. The number of attainablestates is 5478, according to combinatorial analysis.

Some basic strategies to play the game intelligently are detailed in [24], combiningflexibility to adapt to changing situations and stability to satisfy long-term goals.

2.2. Connect-4 Board Game

Connect-4 is a popular board game for two players played on a board with severalrows and columns. One main characteristic of the game is the vertical arrangement of theboard: both players in turn drop one of their pieces into one of the columns (slots). Dueto the gravity rule, the pieces fall to the lowest free position of the respective slot. Thegoal of both opponents is to create a line of four pieces with their color, either horizontally,vertically, or diagonally. The player who can achieve this first wins the game. If all theboard positions become filled and none of the opponents was able to create a line of fourown pieces, the match ends with a tie.

In Figure 2, we can observe a typical Connect-4 game for a 4 × 4 (four rows, fourcolumns) board, won by the first player (blue):

Page 4: Improving Performance of Temporal Difference Learning in ...

Appl. Sci. 2022, 12, 2854 4 of 16Appl. Sci. 2022, 12, x FOR PEER REVIEW 4 of 17

Figure 2. A finished game of Connect-4 on a 4 × 4 board, won by the blue player.

In 1988, two independent research works [25,26] demonstrated that Connect-4 with the standard board size 7 × 6 (seven columns and six rows) is a first-player-win: if the first player makes the right moves, the opponent is bound to lose the game, regardless of the opponent’s moves. For other board sizes (e.g., 4 × 4, 5 × 4, 5 × 5, 6 × 5), it has been proved [27] that Connect-4 presents the same characteristic as tic-tac-toe: a game between two perfect players always ends up in a draw.

The Connect-4 game's state space is considerably larger than tic-tac-toe’s. Even for the smallest board size (four rows and four columns), the number of potential states adds up to 316 (=43,046,721), thus being 7858 times the potential states present in tic-tac-toe. An upper-bound estimation of the number of legal states falls within the range (180,000, 200,000); that is, its state-space complexity is larger than tic-tac-toe’s approximately by a factor of 30.

2.3. TD-Learning Agent In order to examine alternative training procedures, a TD learning agent has been

developed in the C++ language for each board game considered (tic-tac-toe and 4 × 4 Con-nect-4). The generated C++ programs, which include the various training techniques as-sessed, only account for ~400 noncomment source lines (NCSLs) of code each, due to the algorithm's straightforwardness.

Regular tic-tac-toe and Connect-4 game rules are assumed in our implementation of both TD agents. We will assume that the opponent of the TD agent always plays first.

The distinct game states are indicated by an n-digit number with potential digits 0 (an empty position), 1 (a position occupied by the opponent), and 2 (a position held by the TD agent). In theory, this means that there are 3n potential states, but owing to game rules, some of them are not attainable. The TD agent seeks to travel from a given state to the state with the highest estimated utility of all the attainable states in each game. It will perform a random move if all of the attainable states have the same estimated utility.

After playing an elevated number of games against an opponent (training phase), these utilities represent an estimate of the probability of winning from every state. These utilities symbolize the states’ values, and the whole set of utilities is the learned value function. State A has a higher value than state B if the current estimate of the probability of winning from A is higher than it is from B.

Figure 2. A finished game of Connect-4 on a 4 × 4 board, won by the blue player.

In 1988, two independent research works [25,26] demonstrated that Connect-4 withthe standard board size 7 × 6 (seven columns and six rows) is a first-player-win: if thefirst player makes the right moves, the opponent is bound to lose the game, regardless ofthe opponent’s moves. For other board sizes (e.g., 4 × 4, 5 × 4, 5 × 5, 6 × 5), it has beenproved [27] that Connect-4 presents the same characteristic as tic-tac-toe: a game betweentwo perfect players always ends up in a draw.

The Connect-4 game’s state space is considerably larger than tic-tac-toe’s. Even forthe smallest board size (four rows and four columns), the number of potential states addsup to 316 (=43,046,721), thus being 7858 times the potential states present in tic-tac-toe.An upper-bound estimation of the number of legal states falls within the range (180,000,200,000); that is, its state-space complexity is larger than tic-tac-toe’s approximately by afactor of 30.

2.3. TD-Learning Agent

In order to examine alternative training procedures, a TD learning agent has beendeveloped in the C++ language for each board game considered (tic-tac-toe and 4 × 4Connect-4). The generated C++ programs, which include the various training techniquesassessed, only account for ~400 noncomment source lines (NCSLs) of code each, due to thealgorithm’s straightforwardness.

Regular tic-tac-toe and Connect-4 game rules are assumed in our implementation ofboth TD agents. We will assume that the opponent of the TD agent always plays first.

The distinct game states are indicated by an n-digit number with potential digits 0 (anempty position), 1 (a position occupied by the opponent), and 2 (a position held by the TDagent). In theory, this means that there are 3n potential states, but owing to game rules,some of them are not attainable. The TD agent seeks to travel from a given state to the statewith the highest estimated utility of all the attainable states in each game. It will perform arandom move if all of the attainable states have the same estimated utility.

After playing an elevated number of games against an opponent (training phase),these utilities represent an estimate of the probability of winning from every state. Theseutilities symbolize the states’ values, and the whole set of utilities is the learned valuefunction. State A has a higher value than state B if the current estimate of the probability ofwinning from A is higher than it is from B.

For every potential state of the game, utilities are initially set to zero. When each gameis over, the TD agent updates its utility estimates for all of the states that were traveled

Page 5: Improving Performance of Temporal Difference Learning in ...

Appl. Sci. 2022, 12, 2854 5 of 16

throughout the game as follows: The utility function for the final state of the game (terminalstate) will be −1 (opponent won), +1 (TD agent won), or 0.2 (game tied up). Due to theunique nature of tic-tac-toe and 4 × 4 Connect-4, a little positive incentive is providedto game ties, as the game will end in a tie if played intelligently. Utilities are updatedbackward for the rest of the states visited throughout the game, by updating the currentvalue of the earlier state to be closer to the value of the later state.

If we let s denote the state before a move in the game and s′ denote the state after thatmove, then we apply the following update to the estimated utility (U) of state s:

U(s)← U(s) + α [ γU(s)−U(s) ] (1)

where α is the learning rate and γ is the discount factor parameter. Because this updaterule uses the difference in utilities between successive states, it is often called the temporaldifference, or TD, equation. After many training games, the TD agent converges to accurateestimates of the probabilities of winning from every state of the game.

The pseudocode detailed in Box 1 performs the update of the utilities for all the statesvisited throughout a game, after the game has finished, using the TD Equation (1).

Box 1. Update of utilities after each game, using the TD update rule equation.

// U: Utilities for all possible states

// seq: sequence of states visited in the last game

// n_moves: number of moves in the last game

if (TD_agent_won) {

U[current_state] = 1.0;

}

else if (opponent_won) {

U[current_state] = −1.0;

}

else { // tied up

U[current_state] = 0.2;

}

for (i = 1; i <= n_moves; i++) {

// this is the TD update rule equation

// ALPHA -> learning rate; DF -> discount factor

U[seq[n_moves-i]] += ALPHA*(DF*U[current_state]-U[seq[n_moves-i]]);

current_state = seq[n_moves-i];

}

Page 6: Improving Performance of Temporal Difference Learning in ...

Appl. Sci. 2022, 12, 2854 6 of 16

As an example, in Figure 3 we can see the estimated utilities by the tic-tac-toe TD agent(X-player) after having played an elevated number of training games, for all reachable statesin three different games, given three distinct initial positions occupied by the opponent(O-player).

Appl. Sci. 2022, 12, x FOR PEER REVIEW 5 of 17

For every potential state of the game, utilities are initially set to zero. When each game is over, the TD agent updates its utility estimates for all of the states that were trav-eled throughout the game as follows: The utility function for the final state of the game (terminal state) will be -1 (opponent won), +1 (TD agent won), or 0.2 (game tied up). Due to the unique nature of tic-tac-toe and 4 × 4 Connect-4, a little positive incentive is pro-vided to game ties, as the game will end in a tie if played intelligently. Utilities are updated backward for the rest of the states visited throughout the game, by updating the current value of the earlier state to be closer to the value of the later state.

If we let s denote the state before a move in the game and s´ denote the state after that move, then we apply the following update to the estimated utility (U) of state s: 𝑈 𝑠 ← 𝑈 𝑠 𝛼 𝛾𝑈 𝑠´ 𝑈 𝑠 (1)

where α is the learning rate and γ is the discount factor parameter. Because this update rule uses the difference in utilities between successive states, it is often called the temporal difference, or TD, equation. After many training games, the TD agent converges to accu-rate estimates of the probabilities of winning from every state of the game.

The pseudocode detailed in Listing 1 performs the update of the utilities for all the states visited throughout a game, after the game has finished, using the TD Equation (1).

Listing 1. Update of utilities after each game, using the TD update rule equation.

As an example, in Figure 3 we can see the estimated utilities by the tic-tac-toe TD agent (X-player) after having played an elevated number of training games, for all reach-able states in three different games, given three distinct initial positions occupied by the opponent (O-player).

Figure 3. Estimated utilities by the tic-tac-toe TD agent (X-player) for all reachable states in 3 differ-ent games, given 3 distinct initial positions occupied by the opponent (O-player).

// U: Utilities for all possible states // seq: sequence of states visited in the last game // n_moves: number of moves in the last game if (TD_agent_won) { U[current_state] = 1.0; } else if (opponent_won) { U[current_state] = -1.0; } else { // tied up U[current_state] = 0.2; } for (i = 1; i <= n_moves; i++) { // this is the TD update rule equation // ALPHA -> learning rate; DF -> discount factor U[seq[n_moves-i]] += ALPHA*(DF*U[current_state]-U[seq[n_moves-i]]); current_state = seq[n_moves-i]; }

Figure 3. Estimated utilities by the tic-tac-toe TD agent (X-player) for all reachable states in 3 differentgames, given 3 distinct initial positions occupied by the opponent (O-player).

2.4. Training Strategies

The TD agent will be taught by playing games against one of the following opponents:

1. Attack player. The player will first determine whether there is an imminent chanceof winning and if so, will act accordingly. In all other cases, it will act randomly.

2. Defense player. If the TD agent has an imminent chance to win, the player willobstruct it. It will act randomly in any other situation.

3. Attack–Defense player. It integrates the previous two players, giving precedence tothe attack feature. This player will try to win first; if it fails, it will check to see whetherthe TD agent will win in the following move and if so, it will obstruct it. Otherwise, itwill act randomly.

4. Defense–Attack player. It integrates the players (2) and (1), giving precedence to thedefense feature. This player will first determine whether the TD agent has a chance towin the next move and if so, will obstruct it. If this is not the case, it will attempt towin if feasible. Otherwise, it will act randomly.

5. Combined player. This player will behave as one of the above players, chosenrandomly for every game played.

In order to deal with exploratory issues, hybridness has been incorporated in all ofthe above learning agents in the following manner: in a certain proportion of games, eachplayer will “forget” about its strategy and will behave as a completely random player.These games will be interspersed throughout the entire training process.

For the sake of comparison, the Random player (always completely random moves)has also been studied, to have a baseline strategy. It is worth noting that the ideal playerwas not featured as a training option. The perfect player, by definition, will make the bestdecision possible in each scenario to win the game (or at least tie up if the game cannot bewon). As a result, its lack of randomization renders it completely unsuitable for trainingthe TD learning agent.

3. Experiments and Results

Following the learning techniques and rules described in the previous section, wecreated two TD learning agents, one for the tic-tac-toe game and another for the 4 × 4Connect-4 game. The characteristics of the experiments that were conducted are as follows:

• The opponent utilized to train the TD agent is always the first player to move. It willrepresent one of the six training methods described in the preceding section (namelyAttack, Defense, Attack–Defense, Defense–Attack, Combined, and Random). Eachgame’s opening move will always be random.

Page 7: Improving Performance of Temporal Difference Learning in ...

Appl. Sci. 2022, 12, 2854 7 of 16

• The maximum number of training games is set to 2 million. Nevertheless, the trainingprocess finishes if there are no games lost during the last 50,000.

• In the TD update rule equation, the value of the learning rate α parameter is set to 0.25and the value of the discount factor γ parameter is set to 1.

• For every potential state of the game, utilities are initially set to zero in each game.• The degree of hybridness (random games introduced evenly in all the TD learning

agents) is 1 out of 7 (14.286%).• Following the completion of the training procedure, the TD agent will participate in a

series of test games. The values of the utilities assessed by the TD agent during thetraining phase are constant in these test games.

• The number of test games against a fully random player is fixed to 1 million.• In the tic-tac-toe game, the number of test games against an expert player is fixed

to nine games, as the expert player’s moves are invariant for each game situation.Therefore, the only variable parameter is the first move, which can be made in anyof the grid’s nine positions. For the same reason, in the 4 × 4 Connect-4 game, thenumber of test games against an expert player is fixed to four games, correspondingto the four columns of the board.

• The expert player has been implemented using the Minimax recursive algorithm,which chooses an optimal move for a player assuming that the opponent is alsoplaying optimally [28].

All the results have been calculated as the average of 100 different simulation runs.The confidence intervals have been computed by applying the unbiased bootstrap method.As an example, Figure 4 shows the tic-tac-toe TD agent loss ratio distribution histogramsand 95% confidence intervals after a specific number of training games have been playedagainst the Attack–Defense player.

Appl. Sci. 2022, 12, x FOR PEER REVIEW 7 of 17

of test games against an expert player is fixed to four games, corresponding to the four columns of the board.

• The expert player has been implemented using the Minimax recursive algorithm, which chooses an optimal move for a player assuming that the opponent is also play-ing optimally [28]. All the results have been calculated as the average of 100 different simulation runs.

The confidence intervals have been computed by applying the unbiased bootstrap method. As an example, Figure 4 shows the tic-tac-toe TD agent loss ratio distribution histograms and 95% confidence intervals after a specific number of training games have been played against the Attack–Defense player.

Figure 4. Tic-tac-toe TD agent loss ratio distribution histograms and 95% confidence intervals after a specific number of training games (100, 500, 1000, 5000, 10,000, 20,000, 50,000, 100,000, and 200,000) have been played against the Attack–Defense player.

3.1. Tic-Tac-Toe Training Phase Results The outcome of the simulations during the training phase are shown in Figures 5–7.

Each figure has six curves: one for each of the five training options listed in Section 2.4, and an additional curve for the Random player.

The loss ratio for the TD agent when trained using different opponents can be seen in Figure 5, for the first million training games (after that period, values are practically stable). The loss ratio is calculated by dividing the total number of games lost by the total number of games played, inside a moving 10,000-game window.

Figure 4. Tic-tac-toe TD agent loss ratio distribution histograms and 95% confidence intervals after aspecific number of training games (100, 500, 1000, 5000, 10,000, 20,000, 50,000, 100,000, and 200,000)have been played against the Attack–Defense player.

Page 8: Improving Performance of Temporal Difference Learning in ...

Appl. Sci. 2022, 12, 2854 8 of 16

3.1. Tic-Tac-Toe Training Phase Results

The outcome of the simulations during the training phase are shown in Figures 5–7.Each figure has six curves: one for each of the five training options listed in Section 2.4, andan additional curve for the Random player.

Appl. Sci. 2022, 12, x FOR PEER REVIEW 8 of 17

Figure 5. Tic-tac-toe TD agent loss ratio during the training phase as a function of the number of games played, for the different training strategies.

Figure 6 details an enlargement of part of Figure 5, showing the convergence of the TD learning agent for the different training strategies after 100,000 games.

Figure 6. Tic-tac-toe TD agent loss ratio after 100,000 games as a function of the number of games played, for the different training strategies.

Figure 7 illustrates the total number of games lost by the TD agent during the training phase, for the different training strategies.

Figure 5. Tic-tac-toe TD agent loss ratio during the training phase as a function of the number ofgames played, for the different training strategies.

Appl. Sci. 2022, 12, x FOR PEER REVIEW 8 of 17

Figure 5. Tic-tac-toe TD agent loss ratio during the training phase as a function of the number of games played, for the different training strategies.

Figure 6 details an enlargement of part of Figure 5, showing the convergence of the TD learning agent for the different training strategies after 100,000 games.

Figure 6. Tic-tac-toe TD agent loss ratio after 100,000 games as a function of the number of games played, for the different training strategies.

Figure 7 illustrates the total number of games lost by the TD agent during the training phase, for the different training strategies.

Figure 6. Tic-tac-toe TD agent loss ratio after 100,000 games as a function of the number of gamesplayed, for the different training strategies.

Page 9: Improving Performance of Temporal Difference Learning in ...

Appl. Sci. 2022, 12, 2854 9 of 16Appl. Sci. 2022, 12, x FOR PEER REVIEW 9 of 17

Figure 7. Games lost by the tic-tac-toe TD agent during the training phase as a function of the num-ber of games played, for the different training strategies.

3.2. Tic-Tac-Toe Test Phase Results Figure 8 depicts the percentage of games lost by the TD agent in the test phase against

a fully random player (after finishing the training procedure). It should be noted that in the test phase, the TD agent played 1 million games versus the random player.

Figure 8. Percentage of test games won/lost/tied by the tic-tac-toe TD agent playing against a ran-dom player for the different training strategies.

Figure 9 represents the percentage of games the TD agent lost/tied versus an expert player. As previously stated, only nine games are played in this situation, since the expert player movements are invariable.

Figure 7. Games lost by the tic-tac-toe TD agent during the training phase as a function of the numberof games played, for the different training strategies.

The loss ratio for the TD agent when trained using different opponents can be seen inFigure 5, for the first million training games (after that period, values are practically stable).The loss ratio is calculated by dividing the total number of games lost by the total numberof games played, inside a moving 10,000-game window.

Figure 6 details an enlargement of part of Figure 5, showing the convergence of theTD learning agent for the different training strategies after 100,000 games.

Figure 7 illustrates the total number of games lost by the TD agent during the trainingphase, for the different training strategies.

3.2. Tic-Tac-Toe Test Phase Results

Figure 8 depicts the percentage of games lost by the TD agent in the test phase againsta fully random player (after finishing the training procedure). It should be noted that in thetest phase, the TD agent played 1 million games versus the random player.

Appl. Sci. 2022, 12, x FOR PEER REVIEW 9 of 17

Figure 7. Games lost by the tic-tac-toe TD agent during the training phase as a function of the num-ber of games played, for the different training strategies.

3.2. Tic-Tac-Toe Test Phase Results Figure 8 depicts the percentage of games lost by the TD agent in the test phase against

a fully random player (after finishing the training procedure). It should be noted that in the test phase, the TD agent played 1 million games versus the random player.

Figure 8. Percentage of test games won/lost/tied by the tic-tac-toe TD agent playing against a ran-dom player for the different training strategies.

Figure 9 represents the percentage of games the TD agent lost/tied versus an expert player. As previously stated, only nine games are played in this situation, since the expert player movements are invariable.

Figure 8. Percentage of test games won/lost/tied by the tic-tac-toe TD agent playing against arandom player for the different training strategies.

Page 10: Improving Performance of Temporal Difference Learning in ...

Appl. Sci. 2022, 12, 2854 10 of 16

Figure 9 represents the percentage of games the TD agent lost/tied versus an expertplayer. As previously stated, only nine games are played in this situation, since the expertplayer movements are invariable.

Appl. Sci. 2022, 12, x FOR PEER REVIEW 10 of 17

Figure 9. Percentage of test games won/lost/tied by the tic-tac-toe TD agent playing against an ex-pert player for the different training strategies.

3.3. Connect-4 Training Phase Results In Figures 10–12, the training phase results for the 4 × 4 Connect-4 TD learning agent

are presented.

Figure 10. Connect-4 TD agent loss ratio during the training phase as a function of the number of games played, for the different training strategies.

Figure 11. Connect-4 TD agent loss ratio after 100,000 games as a function of the number of games played, for the different training strategies.

Figure 9. Percentage of test games won/lost/tied by the tic-tac-toe TD agent playing against anexpert player for the different training strategies.

3.3. Connect-4 Training Phase Results

In Figures 10–12, the training phase results for the 4 × 4 Connect-4 TD learning agentare presented.

Appl. Sci. 2022, 12, x FOR PEER REVIEW 10 of 17

Figure 9. Percentage of test games won/lost/tied by the tic-tac-toe TD agent playing against an ex-pert player for the different training strategies.

3.3. Connect-4 Training Phase Results In Figures 10–12, the training phase results for the 4 × 4 Connect-4 TD learning agent

are presented.

Figure 10. Connect-4 TD agent loss ratio during the training phase as a function of the number of games played, for the different training strategies.

Figure 11. Connect-4 TD agent loss ratio after 100,000 games as a function of the number of games played, for the different training strategies.

Figure 10. Connect-4 TD agent loss ratio during the training phase as a function of the number ofgames played, for the different training strategies.

Appl. Sci. 2022, 12, x FOR PEER REVIEW 10 of 17

Figure 9. Percentage of test games won/lost/tied by the tic-tac-toe TD agent playing against an ex-pert player for the different training strategies.

3.3. Connect-4 Training Phase Results In Figures 10–12, the training phase results for the 4 × 4 Connect-4 TD learning agent

are presented.

Figure 10. Connect-4 TD agent loss ratio during the training phase as a function of the number of games played, for the different training strategies.

Figure 11. Connect-4 TD agent loss ratio after 100,000 games as a function of the number of games played, for the different training strategies. Figure 11. Connect-4 TD agent loss ratio after 100,000 games as a function of the number of gamesplayed, for the different training strategies.

Page 11: Improving Performance of Temporal Difference Learning in ...

Appl. Sci. 2022, 12, 2854 11 of 16Appl. Sci. 2022, 12, x FOR PEER REVIEW 11 of 17

Figure 12. Games lost by the Connect-4 TD agent during the training phase as a function of the number of games played, for the different training strategies.

3.4. Connect-4 Test Phase Results In Figures 13 and 14, we present the test phase results for the 4 × 4 Connect-4 TD

learning agent.

Figure 13. Percentage of test games won/lost/tied by the Connect-4 TD agent playing against a ran-dom player for the different training strategies.

Figure 12. Games lost by the Connect-4 TD agent during the training phase as a function of thenumber of games played, for the different training strategies.

3.4. Connect-4 Test Phase Results

In Figures 13 and 14, we present the test phase results for the 4 × 4 Connect-4 TDlearning agent.

Appl. Sci. 2022, 12, x FOR PEER REVIEW 11 of 17

Figure 12. Games lost by the Connect-4 TD agent during the training phase as a function of the number of games played, for the different training strategies.

3.4. Connect-4 Test Phase Results In Figures 13 and 14, we present the test phase results for the 4 × 4 Connect-4 TD

learning agent.

Figure 13. Percentage of test games won/lost/tied by the Connect-4 TD agent playing against a ran-dom player for the different training strategies. Figure 13. Percentage of test games won/lost/tied by the Connect-4 TD agent playing against arandom player for the different training strategies.

Page 12: Improving Performance of Temporal Difference Learning in ...

Appl. Sci. 2022, 12, 2854 12 of 16Appl. Sci. 2022, 12, x FOR PEER REVIEW 12 of 17

Figure 14. Percentage of test games won/lost/tied by the Connect-4 TD agent playing against an expert player for the different training strategies.

3.5. Summary of Results Table 1 shows quantitative results of the execution of the experiments in a personal

computer with the following characteristics: Intel Core i5 10210U CPU (1.602.11GHz), 16 GB RAM, Windows 10 64-bit operating system.

Table 1. Quantitative results in the training phase for the different strategies (average measures).

Strategy Tic-tac-toe Training Games

Tic-tac-toe Time (s)

Connect-4 Training Games

Connect-4 Time (s)

Attack–Defense 99,100 1.32 949,300 18.48 Random 2,000,000 13.84 1,780,700 27.45

Defense–Attack 2,000,000 30.68 1,017,400 24.26 Attack 1,776,300 19.63 1,461,200 28.9

Defense 2,000,000 26.35 1,248,700 24.39 Combined 1,184,200 13.79 1,022,500 21.17

Table 2 details the percentage of simulation runs in which the TD agent becomes the perfect player (zero test games lost against both the random and expert players).

Table 2. Percentage of simulation runs in which the TD agent becomes the perfect player.

Strategy Tic-tac-toe TD Agent Becomes Perfect Player

Connect-4 TD Agent Becomes Perfect Player

Attack–Defense 100% of runs 35% of runs Random 0% of runs 15% of runs

Defense–Attack 0% of runs 10% of runs Attack 19% of runs 28% of runs

Defense 0% of runs 11% of runs Combined 50% of runs 34% of runs

Table 3 shows the number of test games lost by the TD agent against a random player, out of a million test games.

Figure 14. Percentage of test games won/lost/tied by the Connect-4 TD agent playing against anexpert player for the different training strategies.

3.5. Summary of Results

Table 1 shows quantitative results of the execution of the experiments in a personalcomputer with the following characteristics: Intel Core i5 10210U CPU (1.602.11GHz),16 GB RAM, Windows 10 64-bit operating system.

Table 1. Quantitative results in the training phase for the different strategies (average measures).

Strategy Tic-Tac-ToeTraining Games

Tic-Tac-ToeTime (s)

Connect-4Training Games

Connect-4Time (s)

Attack–Defense 99,100 1.32 949,300 18.48Random 2,000,000 13.84 1,780,700 27.45

Defense–Attack 2,000,000 30.68 1,017,400 24.26Attack 1,776,300 19.63 1,461,200 28.9

Defense 2,000,000 26.35 1,248,700 24.39Combined 1,184,200 13.79 1,022,500 21.17

Table 2 details the percentage of simulation runs in which the TD agent becomes theperfect player (zero test games lost against both the random and expert players).

Table 2. Percentage of simulation runs in which the TD agent becomes the perfect player.

Strategy Tic-Tac-Toe TD AgentBecomes Perfect Player

Connect-4 TD AgentBecomes Perfect Player

Attack–Defense 100% of runs 35% of runsRandom 0% of runs 15% of runs

Defense–Attack 0% of runs 10% of runsAttack 19% of runs 28% of runs

Defense 0% of runs 11% of runsCombined 50% of runs 34% of runs

Table 3 shows the number of test games lost by the TD agent against a random player,out of a million test games.

Page 13: Improving Performance of Temporal Difference Learning in ...

Appl. Sci. 2022, 12, 2854 13 of 16

Table 3. Number of test games lost by the TD agent against a random player.

StrategyTic-Tac-Toe TD Agent

Test Games LostAgainst Random Player

Connect-4 TD AgentTest Games Lost

Against Random Player

Attack–Defense 0 175.78Random 14,400.85 1,196.59

Defense–Attack 15,883.17 367.06Attack 4,435.57 734.06

Defense 19,805.04 448.16Combined 1425 199.37

Table 4 indicates the number of test games lost by the TD agent against an expertplayer, out of nine test games for tic-tac-toe and four games for Connect-4.

Table 4. Number of test games lost by the TD agent against an expert player.

StrategyTic-Tac-Toe TD Agent

Test Games LostAgainst Expert Player

Connect-4 TD AgentTest Games Lost

Against Expert Player

Attack–Defense 0 0Random 3.96 0.05

Defense–Attack 2.86 0Attack 1.61 0.03

Defense 3.19 0.02Combined 0.65 0.01

4. Discussion4.1. Loss Ratio during the Training Phase

In Figures 5 and 10, we observe that, despite starting with a higher loss ratio duringthe first games of the training process, the Attack–Defense strategy presents a faster con-vergence than the other alternatives considered. The rate of improvement is particularlyhigh between 500 and 5000 games played, as indicated by the slope of the log curve in thatrange. As in most of the strategies, the rate of improvement undergoes a plateau after about5000 games played, but the Attack–Defense strategy consistently achieved zero losses atabout 100,000 games played (see Figures 6 and 11). This is 2–4 orders of magnitude fasterthan the competing strategies, where the plateaus are flatter, and it would take millions ofgames to achieve zero losses.

Regarding the Combined strategy, it has better performance (as measured by theloss ratio) than Attack–Defense during the first 20,000 games played. However, its flatterplateau learning phase makes it keep a residual loss ratio for hundreds of thousands ofgames after it is surpassed by Attack–Defense.

In general, having a better loss ratio at the beginning of training is not necessarilybeneficial, because it might indicate a partial absence of unpredictability in the opponent.It is essential to remark that the number of games lost during the test phase is the mostrelevant metric. In fact, as seen in Figures 5 and 10, the Attack–Defense approach yields ahigher loss ratio than the other techniques in the early stages of the training (left part ofthe figures, between 100 and 1000 games). However, as we will point out below, Attack–Defense exhibits better behavior throughout the test phase.

Observing the Attack–Defense strategy, when compared to the Random one, we cannotice that the loss ratio is clearly worse for the Attack–Defense strategy in the initial stagesof training, but after the crossover point (approximately after 2000 games), the loss ratiofor the Attack–Defense strategy converges to zero, while the convergence for the Randomstrategy is much slower.

Page 14: Improving Performance of Temporal Difference Learning in ...

Appl. Sci. 2022, 12, 2854 14 of 16

4.2. Games Lost during the Training Phase

As shown in Figures 7 and 12, in terms of the number of games lost during thetraining phase, the strategies Random and Attack lose games at a higher rate than theother strategies. In contrast, the number of games lost grows at a slower pace for the otherstrategies (all of which include opponent defense tactics).

The Random strategy loses fewer games during the first 2000 games, but it keepslosing games long after 2 million games have been played. The marginal contribution of avastly increased training phase is therefore very small in this strategy.

In Figure 7 (tic-tac-toe game), after around 100,000 games, it is noted that the Attack–Defense strategy entirely stops losing games. The Attack–Defense approach is able to learnperfect performance while the other strategies slowly inch towards progress but keep losinggames for millions of iterations. None of the other curves show signs of approaching a flatnumber of losses after 2 million games have been played.

On the other hand, in Figure 12 (Connect-4 game), the Attack–Defense strategy seemsto have the best convergence (along with Defense–Attack), but the curve is not com-pletely flat.

4.3. Games Lost during the Test Phase

Figures 8, 9, 13, and 14 show the performance of the different strategies in the testphase, after training is considered complete (Table 1 details the number of training gamesneeded for each strategy to achieve this performance), in which the TD agent is confrontedwith a random player and an expert player.

In Figures 8 and 9 (tic-tac-toe game), we observe how Random and Defense methodsscore poorly against both random and expert opponents, in terms of the number of gameslost during the test phase. Attack, Attack–Defense, and Combined approaches, whichinclude the “Attack” feature in the first place, perform substantially better against theexpert player. Attack–Defense does not have as many wins as other strategies (Random,Attack) against the random player, although it yields perfect performance against theexpert player. All other strategies incur losses—sometimes significant—against the expertplayer, despite having been trained for a much larger number of games (as seen in Table 1).An interesting observation is that the Combined strategy has slightly worse performancethan Attack–Defense when playing against random and expert opponents. This relates tothe sensitivity of the balance between exploration and exploitation mentioned in the firstsection of this document.

In Figures 13 and 14 (Connect-4 game), it can be perceived that the differences amongstrategies are more subtle. The reason is that, because of the nature of the Connect-4 game,only four possible options are available in the first move (as opposed to eight possibilitiesin tic-tac-toe), and consequently it is easier for the TD agent to discover the best moves,regardless of the strategy used. Nevertheless, Attack–Defense remains the best strategy interms of the number of games lost, both against the random and the expert players (thiscan be verified by checking Tables 3 and 4).

4.4. The Attack–Defense Strategy

When playing against both random and expert opponents, the Attack–Defense ap-proach outperforms all other proposed techniques in terms of the number of test gameslost by the TD agent. Table 3 shows that out of 1 million games played against a randomopponent, the TD agent trained using this approach did not lose a single tic-tac-toe gameand lost only 0.0175% of the Connect-4 games played.

In addition, in Table 4 it is noticed that Attack–Defense is the only strategy capable oflosing zero games against an expert player in all simulation runs, for both tic-tac-toe andConnect-4 games.

There is another important advantage in this strategy. As we perceive from the figuresin Table 1, the number of training games and the time needed to complete the trainingphase are significantly better than those of the rest of the strategies.

Page 15: Improving Performance of Temporal Difference Learning in ...

Appl. Sci. 2022, 12, 2854 15 of 16

About the Defense–Attack strategy, it is remarkable that its numbers are close toAttack–Defense in the Connect-4 game (we could argue that it is the second-best strategy),whereas it is one of the worse strategies for tic-tac-toe.

5. Conclusions

Several training strategies for a temporal difference learning agent, implemented assimulated opponent players with different rules, have been presented, described, created,and assessed in this study.

We have proven that if the right training method is used and a certain degree ofhybridness is included in the model, the TD algorithm’s convergence may be seriouslyenhanced. When the Attack–Defense strategy is used, perfection is reached in the classic tic-tac-toe board game with fewer than 100,000 games of training. Previously, many millionsof games were required to achieve comparable ranks in the case of obtaining convergence,according to earlier research works.

Hybrid training strategies improve the efficiency of TD learning, reducing the re-sources needed to be successful (execution time, memory capacity) and ultimately beingmore cost-effective and power-aware.

Despite being particularly easy to construct and having a low computing cost, theintroduced Attack–Defense training approach (try to win; if not feasible, try not to lose;else make a random move) has shown to be extremely successful in our testing, exceedingthe other choices evaluated (even the Combined option). Furthermore, a TD agent that hasbeen taught using this approach may be deemed a perfect tic-tac-toe player, having neverlost a game against both random and expert opponents.

The Attack–Defense strategy presented in this work has also proved to outperformthe rest of the strategies in the 4 × 4 Connect-4 board game (four rows and four columns),which has a considerably higher number of states than the tic-tac-toe game (an upper-boundestimation being 200,000 legal states).

Future research will look into the effectiveness of hybridness and the Attack–Defensetraining technique in board games with a larger space-state complexity (such as Connect-4for larger board sizes, checkers, or go).

Two of the most popular applications of reinforcement learning are game playing andcontrol problems in robotics. The underlying concepts behind the training technique pre-sented in this research work (confront the agent with an opponent with partial knowledgeof how to win and not to lose, including a certain degree of randomness) could be general-ized and applied to training robots in diverse situations, using different environments anddegrees of freedom.

Author Contributions: Conceptualization, J.M.C., P.C.-J. and J.F.-C.; funding acquisition, J.M.C.;investigation, J.M.C., P.C.-J. and J.F.-C.; methodology, J.F.-C.; project administration, J.M.C.; resources,J.M.C.; software, J.F.-C.; supervision, J.M.C.; validation, J.F.-C.; visualization, P.C.-J. and J.F.-C.;writing—original draft, J.F.-C.; writing—review and editing, P.C.-J. and J.F.-C. All authors have readand agreed to the published version of the manuscript.

Funding: The research leading to these results has received funding from RoboCity2030-DIH-CM,Madrid Robotics Digital Innovation Hub, S2018/NMT-4331, funded by “Programas de ActividadesI+D en la Comunidad de Madrid”, and cofunded by Structural Funds of the EU.

Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.

Data Availability Statement: The data presented in this study are available on request from thecorresponding author.

Conflicts of Interest: The authors declare no conflict of interest.

Page 16: Improving Performance of Temporal Difference Learning in ...

Appl. Sci. 2022, 12, 2854 16 of 16

References1. Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement Learning: A Survey. J. Artif. Intell. Res. 1996, 4, 237–285. [CrossRef]2. Sutton, R.; Barto, A. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; pp. 10–15, 156.3. Konen, W.; Bartz-Beielstein, T. Reinforcement learning for games: Failures and successes. In Proceedings of the 11th Annual

Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking Papers, Montreal, QC, Canada,8–12 July 2009; pp. 2641–2648.

4. Samuel, A. Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 1959, 3, 210–229. [CrossRef]5. Tesauro, G. Temporal difference learning of backgammon strategy. In Proceedings of the 9th International Workshop on Machine

Learning, San Francisco, CA, USA, 1–3 July 1992; Morgan Kaufmann Publishers Inc.: Burlington, MA, USA, 1992; pp. 451–457.[CrossRef]

6. Tesauro, G. Practical Issues in Temporal Difference Learning. Reinforcement Learning; Springer: Berlin/Heidelberg, Germany, 1992;pp. 33–53.

7. Wiering, M.A. Self-Play and Using an Expert to Learn to Play Backgammon with Temporal Difference Learning. J. Intell. Learn.Syst. Appl. 2010, 2, 57–68. [CrossRef]

8. Konen, W. Reinforcement Learning for Board Games: The Temporal Difference Algorithm; Technical Report; Research Center CIOP(Computational Intelligence, Optimization and Data Mining), TH Koln—Cologne University of Applied Sciences: Cologne,Germany, 2015. [CrossRef]

9. Gatti, C.J.; Embrechts, M.J.; Linton, J.D. Reinforcement learning and the effects of parameter settings in the game of Chung Toi. InProceedings of the 2011 IEEE International Conference on Systems, Man, and Cybernetics, Anchorage, AK, USA, 9–12 October2011; pp. 3530–3535. [CrossRef]

10. Gatti, C.J.; Linton, J.D.; Embrechts, M.J. A brief tutorial on reinforcement learning: The game of Chung Toi. In Proceedings ofthe ESANN 2011 Proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and MachineLearning, Bruges, Belgium, 27–29 April 2011; ISBN 978-2-87419-044-5.

11. Silver, D.; Sutton, R.; Muller, M. Reinforcement learning of local shape in the game of go. Int. Jt. Conf. Artif. Intell. (IJCAI) 2007, 7,1053–1058.

12. Schraudolph, N.; Dayan, P.; Sejnowski, T. Temporal Difference Learning of Position Evaluation in the Game of Go. Adv. NeuralInf. Processing Syst. 1994, 6, 1–8.

13. Van der Ree, M.; Wiering, M.A. Reinforcement learning in the game of othello: Learning against a fixed opponent and learningfrom self-play. In Proceedings of the 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning(ADPRL), Singapore, 16–19 April 2013; pp. 108–115.

14. Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach, 3rd ed.; Prentice Hall Press: Upper Saddle River, NJ, USA, 2009;ISBN 0136042597/9780136042594.

15. Sutton, R.S. Learning to Predict by the Methods of Temporal Differences. Mach. Learn. 1988, 3, 9–44. [CrossRef]16. Pool, D.; Mackworth, A. Artificial Intelligence: Foundations of Computational Agents, 2nd ed.; Cambridge University Press:

Cambridge, UK, 2017.17. Szubert, M.; Jaskowski, W.; Krawiec, K. Coevolutionary temporal difference learning for Othello. In Proceedings of the

Proceedings of the 5th International Conference on Computational Intelligence and Games (CIG’09), Piscataway, NJ, USA,7 September 2009; IEEE Press: Piscataway, NJ, USA, 2009; pp. 104–111.

18. Krawiec, K.; Szubert, M. Learning n-tuple networks for Othello by coevolutionary gradient search. In Proceedings of theGECCO’2011, Dublin, Ireland, 12–16 July 2011; ACM: New York, NY, USA, 2011; pp. 355–362.

19. Lucas, S.M. Learning to play Othello with n-tuple systems. Aust. J. Intell. Inf. Process. 2008, 4, 1–20.20. Thill, M.; Bagheri, S.; Koch, P.; Konen, W. Temporal Difference Learning with Eligibility Traces for the Game Connect-4. In

Proceedings of the IEEE Conference on Computational Intelligence and Games (CIG), Dortmund, Germany, 26–29 August 2014.21. Steeg, M.; Drugan, M.; Wiering, M. Temporal Difference Learning for the Game Tic-Tac-Toe 3D: Applying Structure to Neural

Networks. In Proceedings of the 2015 IEEE Symposium Series on Computational Intelligence, Cape Town, South Africa, 8–10December 2015; pp. 564–570. [CrossRef]

22. Fernández-Conde, J.; Cuenca-Jiménez, P.; Cañas, J.M. An Efficient Training Strategy for a Temporal Difference Learning BasedTic-Tac-Toe Automatic Player. In Inventive Computation Technologies, International Conference on Inventive Computation TechnologiesICICIT 2019, Coimbatore, India, 29–30 August 2019; Lecture Notes in Networks and Systems; Smys, S., Bestak, R., Rocha, Á., Eds.;Springer: Cham, Switzerland, 2019; Volume 98. [CrossRef]

23. Baum, P. Tic-Tac-Toe. Master’s Thesis, Computer Science Department, Southern Illinois University, Carbondale, IL, USA, 1975.24. Crowley, K.; Siegler, R.S. Flexible strategy use in young children’s tic-tac-toe. Cogn. Sci. 1993, 17, 531–561. [CrossRef]25. Allen, J.D. A Note on the Computer Solution of Connect-Four. Heuristic Program. Artif. Intell. 1989, 1, 134–135.26. Allis, V.L. A Knowledge-Based Approach of Connect-Four. Master’s Thesis, Vrije Universiteit, Amsterdam, The Netherlands, 1988.27. Tromp, J. Solving Connect-4 in Medium Board Sizes. ICGA J. 2008, 31, 110–112. [CrossRef]28. Swaminathan, B.; Vaishali, R.; Subashri, R. Analysis of Minimax Algorithm Using Tic-Tac-Toe. In Intelligent Systems and Computer

Technology; Hemanth, D.J., Kumar, V.D.A., Malathi, S., Eds.; IOS Press: Amsterdam, The Netherlands, 2020. [CrossRef]