Deep Q-learning to play tic tac toe

Q-learning is a reinforced machine learning technique, which can be applied to learn an agent a policy or tactic. As an example, the deep Q-learning algorithm is trained to play TictacToe (Play TicTacToe). By playing 100.000 games with variations of itself, the Q-agent learned a tactic to play tic-tac-toe in an optimal way (win or draw).

How the qagent learned this tactic is descibed in the follwing sections.

Q learning

So how does the Q-agent learns the winning tactic? Before explaining the algorithm, let start with some defintions. We have a Q-agent, which makes the decisions of the desired action (A). In this case the Q-agent is a deep neural network, but it could also be a table. This Q-agent makes the desicion of the action based on a state, S. Based on the chosen action action, the algorith gets feedback in the form of a reward (r). When playing TictacToe, the board is represented by the state and the actions are the possible moves in the game.

An ideal Q-agent makes the desicion in such a way that the total future reward is maximum. In other words the action that the Q-agent select will result in a total reward ($R_t$) over all next moves which is maximum: $$R_t = r_t + r_{t+1} + ... + r_{n}$$ To distuiguish between short and long term values, we will introduce a discount factor ($\gamma$). This discount factor is formulated as a factor to the power ($n-t$). This definition is used because it is also equal to the current reward plus the discount factor time the total future reward: $$R_t = r_t + \gamma r_{t+1} + ... + \gamma^{n-t}r_{n} = r_t + \gamma R_{t+1}$$ A discount factor close to 1 means that the long term reward is important, whereas a discount factor close to 0 favours the short term reward. In the TictacToe example the reward is +1 for move without a consequence, +20 for a winning move, +10 for move whch lead to remise and -20 for a move which lead to losing the game.

So now we need a function, which can predict the total future reward ($R_t$) given an action and the current state (the Q function), $$R_t = Q(S_t,A)$$ Based on the current state and a action, this function returns the total future reward. So the maximum of this function over all possible actions, gives the maximum future reward which can be achieved, $$ R_t^* = \max_A{Q(S,A)}$$ Based on this function, we can also select the desired policy/tactic ($\pi$). The desired action is the action which result in the maximum value of the Q-function, $$ \pi(s) = \arg \max_A{Q(S,A)}$$ So if the TictacToe board has the follwing configurations (o his turn)

x | - | -
o | x | -
o | - | -

the q-function should give the total future reward for each (possible) action. In this case the total future reward should be maximum for the action coresponding to the move right-down, because that move prevent o from winning and gives the opportunity to obtain a larger future reward.

The only problem is how to obtain a Q-function which can predict the total future reward for all possible actions and states. To learn the q-agent, we need an algorithm to train this q-function. To do so we can express the Q-fucntion in terms of the next state ($S_{t+1} $) and the next action($A_{t+1} $), $$Q(S_t,A_t) = r_t + \gamma R_{t+1} = r_t + \gamma max_{A_{t+1}}Q(S_{t+1},A_{t+1})$$ This equation is called the Bellman equation. We can use this equation to iterativly solve for the optimal Q-function. When we rewrite the Bellman equation as a function of the previous Q function, we can update the Q0function, $$Q^{new}(S_t,A_t) = Q(S_t,A_t) + \alpha[ r_t + \gamma max_{A+1}Q(S_{t+1},A_{t+1}) - Q(S_t-A_t)]$$ where $\alpha $ is a learnig rate, whih controls the learning speed during each iteration.

Deep neural network as Q-function

In traditional q-learning methods a Q-table is used as Q-function. In the current application a deep neural network is used as Q-function. The difference between the tradional Q-function and the neural network, is that the action is not needed as input. So the neural network predicts the total future reward for every action ($m$ number of actions) based on the current state. $$ \vec{R_t} =\begin{bmatrix} R_t(A_a)\\R_t(A_{a+1})\\R_t(A_m) \end{bmatrix}= Q(S_t)$$ Thus, the ouput of the Q-function is a vector with the total future reward for every action. To train the network, the total future reward given the next state is used (the next state is obtained after a action ($A_t^*$), $$R_t^* = r_t + \gamma \arg \max(Q(S_{t+1}))$$ The neural networ is trained with the predicted total future reward given an action (($A_t^*$), where the total future reward based on the next state (($R_t^*$) is substitute for the total future reward for the selected action, The neural network is trained with the input (current state, ($S_t$) and the follwing output, $$\vec{R_t} =\begin{bmatrix} R_t(A_a)\\R_t^*(A^*_{a+1})\\R_t(A_m) \end{bmatrix}$$

Exploration vs. Exploitation

The Q-function always select the action with the maximum outcome from the Q-function. However, in the begening of the training, the Q-functions cannot give good predictions of the Q-functions. So in the begining it good that the qagent makes some random decisons. To introduce these random desicions, a epsilon parameter is added. When a random number between 0 and 1 is lower than epsilon, a rondom action is chosen inctead of the preiction from the Q-function. So this parameters determines the chance that a random desicon is made. As the model becomes better after every epsiode, the value of epsilon can become smaller after every episode. Thus afer every epsiode the epsilon is updated according, $$\epsilon = \epsilon \epsilon_{decay}$$ Moreover, there is also a minim epsilon applied. Thus even a well trained agent still does some random moves.

Double q-learning

Click here to play TicTacToe