Qagent
Q-learning is a reinforced machine learning technique, which can be applied to learn an agent a policy or tactic. As an example, the deep Q-learning algorithm is trained to play TictacToe (Play TicTacToe). By playing 100.000 games with variations of itself, the Q-agent learned a tactic to play tic-tac-toe in an optimal way (win or draw).
How the qagent learned this tactic is descibed in the follwing sections.
So how does the Q-agent learns the winning tactic? Before explaining the algorithm, let start with some defintions. We have a Q-agent, which makes the decisions of the desired action (A). In this case the Q-agent is a deep neural network, but it could also be a table. This Q-agent makes the desicion of the action based on a state, S. Based on the chosen action action, the algorith gets feedback in the form of a reward (r). When playing TictacToe, the board is represented by the state and the actions are the possible moves in the game.
An ideal Q-agent makes the desicion in such a way that the total future reward is maximum. In other words the action that the Q-agent select will result in a total reward (\(R_t\)) over all next moves which is maximum: $$R_t = r_t + r_{t+1} + ... + r_{n}$$ To distuiguish between short and long term values, we will introduce a discount factor (\(\gamma\)). This discount factor is formulated as a factor to the power (\(n-t\)). This definition is used because it is also equal to the current reward plus the discount factor time the total future reward: $$R_t = r_t + \gamma r_{t+1} + ... + \gamma^{n-t}r_{n} = r_t + \gamma R_{t+1}$$ A discount factor close to 1 means that the long term reward is important, whereas a discount factor close to 0 favours the short term reward. In the TictacToe example the reward is +1 for move without a consequence, +20 for a winning move, +10 for move whch lead to remise and -20 for a move which lead to losing the game.
So now we need a function, which can predict the total future reward (\(R_t\)) given an action and the current state (the Q function),
$$R_t = Q(S_t,A)$$
Based on the current state and a action, this function returns the total future reward. So the maximum of this function over all possible actions, gives the maximum future reward which can be achieved,
$$ R_t^* = \max_A{Q(S,A)}$$
Based on this function, we can also select the desired policy/tactic (\(\pi\)). The desired action is the action which result in the maximum value of the Q-function,
$$ \pi(s) = \arg \max_A{Q(S,A)}$$
So if the TictacToe board has the follwing configurations (o his turn)
x | - | -
o | x | -
o | - | -
the q-function should give the total future reward for each (possible) action. In this case the total future reward should be maximum for the action coresponding to the move right-down, because that move prevent o from winning and gives the opportunity to obtain a larger future reward.
The only problem is how to obtain a Q-function which can predict the total future reward for all possible actions and states. To learn the q-agent, we need an algorithm to train this q-function. To do so we can express the Q-fucntion in terms of the next state (\(S_{t+1} \)) and the next action(\(A_{t+1} \)), $$Q(S_t,A_t) = r_t + \gamma R_{t+1} = r_t + \gamma max_{A_{t+1}}Q(S_{t+1},A_{t+1})$$ This equation is called the Bellman equation. We can use this equation to iterativly solve for the optimal Q-function. When we rewrite the Bellman equation as a function of the previous Q function, we can update the Q0function, $$Q^{new}(S_t,A_t) = Q(S_t,A_t) + \alpha[ r_t + \gamma max_{A+1}Q(S_{t+1},A_{t+1}) - Q(S_t-A_t)]$$ where \(\alpha \) is a learnig rate, whih controls the learning speed during each iteration.