One post tagged with "hello"

Tic-Tac-Toe AI using Deep Q-Learning

September 5, 2019 · 11 min read

Co-creator of Docusaurus 1

On December 5, 2017, DeepMind introduced AlphaZero to the world, a chess-playing program that defeated Stockfish simply by playing against itself for 4 hours. Impressed by AlphaZero, I wrote a simple AI to play Tic-Tac-Toe based on Deep Q-Learning, and today I want to share it with you.

In this article, I will cover Markov Decision Process, Q-Learning, Deep Q-Learning, and the application of Deep Q-learning to Tic-Tac-Toe.

1. What is a Markov Decision Process? The Problem with MDP

A Markov Decision Process (MDP) provides a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. For a problem to be reduced to an MDP, the states in that problem must satisfy the Markov property: each state depends only on the state preceding it.

The transition from state to state has a transition probability defined by a specific formula.

Describing the Tic-Tac-Toe problem as an MDP

Agent: An agent can be an entity that interacts with the environment by observing and deciding on actions based on those observations. In Tic-Tac-Toe, the agent is the player. The player's task is to win the match; using the symbols 'X' and 'O', 2 players take turns filling in empty cells, and the side that gets 3 in a row wins.

State: After each action taken by the agent, the environment returns a state for the agent to observe and determine the next action. In Tic-Tac-Toe, I represent the state as an array of length 9, corresponding to the 9 cells in the 3x3 board. Since the machine cannot understand the characters 'X' or 'O', we define empty cells as having a value of 0, cells marked by player X as 1, and O as -1.

Action: Each cell on the board corresponds to an action numbered from 0 to 8. For cells that have already been marked, that action is considered invalid and ignored.

Reward: For every action the agent takes, the environment returns a reward. We can view the reward as the environment's assessment of the agent's accuracy in taking that action; throughout the interaction with the environment, the agent's goal is to maximize the total reward. In Tic-Tac-Toe, the reward is +1 when the agent chooses an action that ends the game with a win, -1 for a loss, and 0 otherwise.

For the agent to select an action, we need an action-value function, denoted as . The action-value function indicates the total reward when choosing action at state . (gamma) is the discount factor; this value characterizes the importance of future rewards—the smaller the value, the less important they are.

Since the agent's goal is to maximize total reward, the remaining task is to find the action with the largest action-value.

Returning to Tic-Tac-Toe, suppose we currently have a state. The next turn belongs to player O. Passing each action in the state through the action-value function, we get specific values. Ignoring occupied cells, we see that the action at cell 4 has the highest value, so the agent will choose action 4. In reality, this move is good, so we can say the policy at this state is optimal.

Thus, we have modeled the problem as an MDP. But how do we create the action-value function? We can create the action-value function by solving the Bellman equation; however, with Tic-Tac-Toe, the number of states is too large for calculation, so the Q-learning algorithm will help us do this.

What is Q-learning?

Q-learning is model-free; by repeating actions and receiving rewards, Q-learning can create an action-value function. The action-value function in Q-learning is a look-up table called a Q-table, initialized with arbitrary values.

With each step in an episode, the agent selects an action using epsilon-greedy.

Epsilon-greedy is an algorithm for selecting actions based on an epsilon value (). First, the algorithm randomizes a number (). If , it selects a random action; otherwise, it chooses the greedy action with the highest value.
The epsilon value is used to balance exploration and exploitation. If the agent only chooses greedy actions, it might miss other actions with higher rewards. If it only chooses random actions, the Q-table will converge slowly.
In practice, one initializes epsilon as a large number, such as 0.99, and then gradually decreases this value to a smaller number, like 0.1.

Next, the action is sent to the environment, and the environment returns the next state and reward. The Q-table is updated based on the formula where is the learning rate ().

Deep Q-learning

The limitation of Q-learning is that if the state space is too large, it takes a lot of capacity to store the Q-table and slows down learning time because the Q-table is accessed continuously during training. Thus, Deep Q-learning (DQN) was born.

DQN inherits everything from Q-learning, but the Q-table is replaced by a deep neural network from deep learning (Deep neural network + Q-learning = Deep Q-learning). To calculate the action-value, we simply input the state, and the neural network immediately outputs the action-values.

However, to avoid overfitting, we use two neural networks here: a main network called the action-value network and an auxiliary network called the target action-value network.

Another issue with DQN is catastrophic forgetting. To reduce this problem, we use a "lifehack" called experience replay. For each step, we have a tuple (s, a, r, s', done) saved to memory, where done is a flag indicating if is a terminal state. We then sample a mini-batch and feed it into the neural network for training.

The update function also differs slightly from Q-learning. Let's look at the entire algorithm process.

Applying DQN to Tic-Tac-Toe

Now let's start coding.

Environment Setup

First, build the Tic-Tac-Toe environment.

import random

class Tictactoe_v0:
    def __init__(self):
        [cite_start]self.board = [0] * 9 # [cite: 346]
        self.wining_position = [[0, 1, 2], [3, 4, 5], [6, 7, 8],
                                [0, 3, 6], [1, 4, 7], [2, 5, 8],
                                [cite_start][0, 4, 8], [6, 4, 2]] # [cite: 347-368]
        [cite_start]self.current_turn = 1 # [cite: 369]
        [cite_start]self.player_mark = 1 # [cite: 370]

    def reset(self, is_human_first):
        [cite_start]self.board = [0] * 9 # [cite: 372]
        [cite_start]self.current_turn = 1 # [cite: 373]
        [cite_start]self.player_mark = 1 if is_human_first else -1 # [cite: 374]
        if not is_human_first:
            [cite_start]self.env_act() # [cite: 376]
        [cite_start]return self.board.copy() # [cite: 381]

    def check_win(self):
        for pst in self.wining_position:
            # Check for win condition
            if str(self.board[pst[0]]) + str(self.board[pst[1]]) + str(self.board[pst[2]]) in [str(self.current_turn) * 3]:
                 if self.current_turn == self.player_mark:
                     [cite_start]return 1, True # [cite: 386]
                 [cite_start]return -1, True # [cite: 387] # Note: OCR suggests 1, but context implies loss/opponent win
        if 0 not in self.board:
            [cite_start]return 0, True # [cite: 389]
        [cite_start]return 0, False # [cite: 390]

    def env_act(self):
        [cite_start]action = random.choice([i for i in range(len(self.board)) if self.board[i] == 0]) # [cite: 392]
        # Basic logic to block or win (simplified from OCR text)
        for pst in self.wining_position:
            com = str(self.board[pst[0]]) + str(self.board[pst[1]]) + str(self.board[pst[2]])
            if com.replace('0', '') == str(self.current_turn) * 2:
                 if self.board[pst[0]] == 0: action = pst[0]
                 elif self.board[pst[1]] == 0: action = pst[1]
                 else: action = pst[2]
        
        if self.board[action] != 0:
            [cite_start]raise Exception('Invalid action') # [cite: 406]
        [cite_start]self.board[action] = self.current_turn # [cite: 407]
        [cite_start]reward, done = self.check_win() # [cite: 408]
        [cite_start]self.current_turn = self.current_turn * -1 # [cite: 409-410]
        [cite_start]return reward, done # [cite: 411]

    def step(self, action):
        if self.board[action] != 0:
            [cite_start]raise Exception('Invalid action') # [cite: 414]
        [cite_start]self.board[action] = self.current_turn # [cite: 415]
        [cite_start]reward, done = self.check_win() # [cite: 416]
        [cite_start]self.current_turn = self.current_turn * -1 # [cite: 417-419]
        if done:
            [cite_start]return self.board.copy(), reward, done, None # [cite: 420]
        [cite_start]reward, done = self.env_act() # [cite: 421]
        [cite_start]return self.board.copy(), reward, done, None # [cite: 422]

Implement Epsilon-Greedy

import numpy as np

class EpsilonGreedy:
    def __init__(self, epsilon):
        [cite_start]self.epsilon = epsilon # [cite: 426]

    def perform(self, q_value, action_space: list = None):
        [cite_start]prob = np.random.sample() # get probability of taking random action [cite: 432-433]
        [cite_start]if prob < self.epsilon: # take random action [cite: 434]
            if action_space is None: # all actions are available
                [cite_start]return np.random.randint(len(q_value)) # [cite: 436]
            [cite_start]return np.random.choice(action_space) # [cite: 437]
        [cite_start]else: # take greedy action [cite: 438]
            if action_space is None:
                [cite_start]return np.argmax(q_value) # [cite: 440]
            [cite_start]return max([[q_value[a], a] for a in action_space], key=lambda x: x[0])[1] # [cite: 441]

    def decay(self, decay_value, lower_bound):
        # Adjust the epsilon value by the formula: epsilon = max(decayValue * epsilon, lowerBound)
        [cite_start]self.epsilon = max(self.epsilon * decay_value, lower_bound) # [cite: 447]

Implement Experience Replay

class ExperienceReplay:
    def __init__(self, e_max: int):
        if e_max <= 0:
            [cite_start]raise ValueError('Invalid value for memory size') # [cite: 451-452]
        self.e_max = e_max
        self.memory = list()
        [cite_start]self.index = 0 # [cite: 455]

    def add_experience(self, sample: list):
        if len(sample) != 5:
            [cite_start]raise Exception('Invalid sample') # [cite: 457-458]
        if len(self.memory) < self.e_max:
             [cite_start]self.memory.append(sample) # [cite: 462]
        else:
             [cite_start]self.memory[self.index] = sample # [cite: 463]
        [cite_start]self.index = (self.index + 1) % self.e_max # [cite: 464]

    def sample_experience(self, sample_size: int, cer_mode: bool):
        [cite_start]samples = random.sample(self.memory, sample_size) # [cite: 466]
        if cer_mode:
            [cite_start]samples[-1] = self.memory[self.index - 1] # [cite: 467]
        
        # state_samples, action_samples, reward_samples, next_state_samples, done_samples
        s_batch, a_batch, r_batch, ns_batch, done_batch = map(np.array, zip(*samples))
        [cite_start]return s_batch, a_batch, r_batch, ns_batch, done_batch # [cite: 468]

    def get_size(self):
        [cite_start]return len(self.memory) # [cite: 474]

Implement DQN

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers, losses

class DQN: # Inherits BaseModel in original code
    def __init__(self, discount_factor: float, epsilon: float, e_min: int, e_max: int):
        # super().__init__(discount_factor, epsilon, e_min, e_max)
        [cite_start]self.epsilon_greedy = EpsilonGreedy(epsilon) # [cite: 480]
        [cite_start]self.gamma = discount_factor # [cite: 481]
        [cite_start]self.e_min = e_min # [cite: 482]
        [cite_start]self.exp_replay = ExperienceReplay(e_max) # [cite: 483]
        [cite_start]self.training_network = Sequential() # [cite: 484-485]
        [cite_start]self.target_network = Sequential() # [cite: 486-487]
        [cite_start]self.cache = list() # [cite: 488]

    def observe(self, state, action_space: list = None):
        [cite_start]q_value = self.training_network.predict(np.array([state])).ravel() # [cite: 490]
        if action_space is not None:
            [cite_start]return max([[q_value[a], a] for a in action_space], key=lambda x: x[0])[1] # [cite: 492]
        return np.argmax(q_value)

    def observe_on_training(self, state, action_space: list = None) -> int:
        q_value = self.training_network.predict(np.array([state])).ravel()
        [cite_start]action = self.epsilon_greedy.perform(q_value, action_space) # [cite: 493]
        [cite_start]self.cache.extend([state, action]) # [cite: 494]
        [cite_start]return action # [cite: 495]

    def take_reward(self, reward, next_state, done):
        self.cache.extend([reward, next_state, done])
        [cite_start]self.exp_replay.add_experience(self.cache.copy()) # [cite: 497]
        [cite_start]self.cache.clear() # [cite: 498]

    def train_network(self, sample_size: int, batch_size: int, epochs: int, verbose: int = 2):
        if self.exp_replay.get_size() >= self.e_min:
            # state_samples, action_samples, reward_samples, next_state_samples, done_samples
            s_batch, a_batch, r_batch, ns_batch, done_batch = self.exp_replay.sample_experience(sample_size, False)
            states, q_values = self.replay(s_batch, a_batch, r_batch, ns_batch, done_batch)
            history = self.training_network.fit(states, q_values, epochs=epochs, batch_size=batch_size, verbose=verbose)
            [cite_start]return history.history['loss'] # [cite: 500]

    def replay(self, states, actions, rewards, next_states, terminals):
        [cite_start]q_values = self.target_network.predict(np.array(states)) # get q value at state t [cite: 502]
        nq_values = self.target_network.predict(np.array(next_states)) # get q value at state t+1
        
        for i in range(len(states)):
            a = actions[i]
            [cite_start]done = terminals[i] # [cite: 503-504]
            [cite_start]r = rewards[i] # [cite: 508]
            if done:
                 [cite_start]q_values[i][a] = r # [cite: 511]
            else:
                 [cite_start]q_values[i][a] = r + self.gamma * np.max(nq_values[i]) # [cite: 513-514]
        [cite_start]return states, q_values # [cite: 515]

    def update_target_network(self):
        [cite_start]self.target_network.set_weights(self.training_network.get_weights()) # [cite: 517]

Building the DQN model

I use Keras to build the neural network.

[cite_start]env = Tictactoe_v0() # [cite: 519]
[cite_start]agent = DQN(0.7, 1, 4096, 1048576) # [cite: 520]
[cite_start]op1 = optimizers.RMSprop(learning_rate=0.00025) # [cite: 521]

[cite_start]agent.training_network.add(Dense(128, activation='relu', input_shape=(9,))) # [cite: 522]
[cite_start]agent.training_network.add(Dense(128, activation='relu')) # [cite: 523]
[cite_start]agent.training_network.add(Dense(9, activation='linear')) # [cite: 524]
[cite_start]agent.training_network.compile(optimizer=op1, loss=losses.mean_squared_error) # [cite: 525]

[cite_start]op2 = optimizers.RMSprop(learning_rate=0.00025) # [cite: 526]
[cite_start]agent.target_network.add(Dense(128, activation='relu', input_shape=(9,))) # [cite: 527]
[cite_start]agent.target_network.add(Dense(128, activation='relu')) # [cite: 528]
[cite_start]agent.target_network.add(Dense(9, activation='linear')) # [cite: 529]
[cite_start]agent.target_network.compile(optimizer=op2, loss=losses.mean_squared_error) # [cite: 530]

agent.update_target_network()

Here, I only trained the AI to play first.

Evaluation

After the training process, the model is evaluated. I had the agent play against a random player and a minimax player. Each round consisted of 1000 matches, and the first move was chosen randomly. Here are the results.

VS MINIMAX PLAYER

Training times	Wins	Draws	Loses
50000	0	1000	0

VS RANDOM PLAYER

Training times	Wins	Draws	Loses
50000	809	190	1

Conclusion

When playing against the Minimax player, the AI plays perfectly. However, when the opponent is a random player, the AI makes mistakes. I trained for another 100,000 episodes, but the results were not more optimistic, so I stopped.

You can view my entire code here.

References

introduction-deep-q-learning-python
Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem
Reinforcement Learning: An Introduction
Human Level Control Through Deep Reinforcement Learning

1. What is a Markov Decision Process? The Problem with MDP​

Environment Setup​

Building the DQN model​

Evaluation​

Conclusion​

1. What is a Markov Decision Process? The Problem with MDP

Environment Setup

Building the DQN model

Evaluation

Conclusion