Assignment 2

Artificial Intelligence in Games

In this assignment, you will implement a variety of reinforcement learning algorithms to find policies for the frozen lake environment. Please read this entire document before you start working on the assignment.

1 Environment

The code presented in this section uses the NumPy library. If you are not familiar with NumPy, please read the NumPy quickstart tutorial and the NumPy broadcasting tutorial.

The frozen lake environment has two main variants: the small frozen lake (Fig. 1) and the big frozen lake (Fig. 2). In both cases, each tile in a square grid corresponds to a state. There is also an additional absorbing state, which will be introduced soon. There are four types of tiles: start (grey), frozen lake (light blue), hole (dark blue), and goal (white). The agent has four actions, which correspond to moving one tile up, left, down, or right. However, with probability 0.1, the environment ignores the desired direction and the agent slips (moves one tile in a random direction, which may be the desired direction). An action that would cause the agent to move outside the grid leaves the state unchanged.

Figure 1: Small frozen lake

Figure 2: Big frozen lake

The agent receives reward 1 upon taking an action at the goal. In every other case, the agent receives zero reward. Note that the agent does not receive a reward upon moving into the goal (nor a negative reward upon moving into a hole). Upon taking an action at the goal or in a hole, the agent moves into the absorbing state. Every action taken at the absorbing state leads to the absorbing state, which also does not provide rewards. Assume a discount factor of γ = 0.9.

For the purposes of model-free reinforcement learning (or interactive testing), the agent is able to interact with the frozen lake for a number of time steps that is equal to the number of tiles.

Your first task is to implement the frozen lake environment. Using Python, try to mimic the interface presented in Listing 1.

The class EnvironmentModel represents a model of an environment. The constructor of this class receives a number of states, a number of actions, and a seed that controls the pseudorandom number generator. Its subclasses must implement two methods: p and r. The method p returns the probability of transitioning from state to next state given action. The method r returns the expected reward in having transitioned from state to next state given action. The method draw receives a pair of state and action and returns a state drawn according to p together with the corresponding expected reward. Note that states and actions are represented by integers starting at zero. We highly recommend that you follow the same convention, since this will facilitate immensely the implementation of reinforcement learning algorithms. You can use a Python dictionary (or equivalent data structure) to map (from and to) integers to a more convenient representation when necessary. Note that, in general, agents may receive rewards drawn probabilistically by an environment, which is not supported in this simplified implementation.

2 Tabular model-based reinforcement learning

Your next task is to implement policy evaluation, policy improvement, policy iteration, and value iteration. You may follow the interface suggested in Listing 2.

Listing 2: Tabular model-based algorithms.

def policy evaluation (env , policy , gamma, theta , max iterations ): value = np.zeros(env.n states , dtype=np.float)

The function policy evaluation receives an environment model, a deterministic policy, a discount factor, a tolerance parameter, and a maximum number of iterations. A deterministic policy may be represented by an array that contains the action prescribed for each state.

The function policy improvement receives an environment model, the value function for a policy to be improved, and a discount factor.

The function policy iteration receives an environment model, a discount factor, a tolerance parameter, a maximum number of iterations, and (optionally) the initial policy.

The function value iteration receives an environment model, a discount factor, a tolerance parameter, a maximum number of iterations, and (optionally) the initial value function.

3 Tabular model-free reinforcement learning

Your next task is to implement Sarsa control and Q-learning control. You may follow the interface suggested in Listing 3. We recommend that you use the small frozen lake to test your implementation, since these algorithms may need many episodes to find an optimal policy for the big frozen lake.

The function sarsa receives an environment, a maximum number of episodes, an initial learning rate, a discount factor, an initial exploration factor, and an (optional) seed that controls the pseudorandom number generator. Note that the learning rate and exploration factor decrease linearly as the number of episodes increases (for instance, eta[i] contains the learning rate for episode i).

The function q learning receives an environment, a maximum number of episodes, an initial learning rate, a discount factor, an initial exploration factor, and an (optional) seed that controls the pseudorandom number generator. Note that the learning rate and exploration factor decrease linearly as the number of episodes increases (for instance, eta[i] contains the learning rate for episode i).

Important: The ε-greedy policy based on Q should break ties randomly between actions that maximize Q for a given state. This plays a large role in encouraging exploration.

4 Non-tabular model-free reinforcement learning

In this task, you will treat the frozen lake environment as if it required linear action-value function approxi- mation. Your task is to implement Sarsa control and Q-learning control using linear function approximation. In the process, you will learn that tabular model-free reinforcement learning is a special case of non-tabular model-free reinforcement learning. You may follow the interface suggested in Listing 4.

The class LinearWrapper implements a wrapper that behaves similarly to an environment that is given to its constructor. However, the methods reset and step return a feature matrix when they would typically return a state s. The a-th row of this feature matrix contains the feature vector φ(s, a) that represents the pair of action and state (s, a). The method encode state is responsible for representing a state by such a feature matrix. More concretely, each possible pair of state and action is represented by a different vector where all elements except one are zero. Therefore, the feature matrix has |S||A| columns. The method decode policy receives a parameter vector θ obtained by a non-tabular reinforcement learning algorithm and returns the corresponding greedy policy together with its value function estimate.

The function linear sarsa receives an environment (wrapped by LinearWrapper), a maximum number of episodes, an initial learning rate, a discount factor, an initial exploration factor, and an (optional) seed that controls the pseudorandom number generator. Note that the learning rate and exploration factor decay linearly

as the number of episodes grows (for instance, eta[i] contains the learning rate for episode i).

The function linear q learning receives an environment (wrapped by LinearWrapper), a maximum number of episodes, an initial learning rate, a discount factor, an initial exploration factor, and an (optional) seed that controls the pseudorandom number generator. Note that the learning rate and exploration factor decay linearly

as the number of episodes grows (for instance, eta[i] contains the learning rate for episode i).

The Q-learning control algorithm for linear function approximation is presented in Algorithm 1. Note that this algorithm uses a slightly different convention for naming variables and omits some details for the sake of

simplicity (such as learning rate/exploration factor decay).

Algorithm 1 Q-learning control algorithm for linear function approximation

Input: feature vector φ(s, a) for all state-action pairs (s, a), number of episodes N , learning rate α, exploration

factor ε, discount factor γ Output: parameter vector θ

1: θ←0

2: for each i in {1,...,N} do

3: s ← initial state for episode i

4: for each action a: do

5: Q(a) ← ?i θiφ(s, a)i

6: end for

7: while state s is not terminal do

8: if with probability 1 − ε: then

a ← arg maxa Q(a)

10: else

11: a ← random action

12: end if

13: r ← observed reward for action a at state s

14: s′ ← observed next state for action a at state s

15: δ←r−Q(a)

16: for each action a′ : do

17: Q(a′) ← ?i θiφ(s′, a′)i

18: end for

19: δ ← δ + γ maxa′ Q(a′) {Note: δ is the temporal difference}

20: θ←θ+αδφ(s,a)

21: s←s′

22: end while

23: end for

Important: The ε-greedy policy based on Q should break ties randomly between actions that maximize Q (Algorithm 1, Line 9). This plays a large role in encouraging exploration.

5 Deep reinforcement learning

The code presented in this section uses the PyTorch library. If you are not familiar with PyTorch, please read the Learn the Basics tutorial.

In this task, you will implement the deep Q-network learning algorithm [Mnih et al., 2015] and treat the frozen lake environment as if it required non-linear action-value function approximation. In the process, you will learn how to train a reinforcement learning agent that receives images as inputs. You may follow the interface suggested in Listing 5.

The class FrozenLakeImageWrapper implements a wrapper that behaves similarly to a frozen lake environ- ment that must be given to its constructor. However, the methods reset and step return an image when they would typically return a state. This image is composed of four channels and is represented by a numpy.array of shape (4,h,w), where h is the number of rows and w is the number of columns of the lake grid. The first channel of this image is a h × w matrix whose elements are all zero except for the element that corresponds to the position of the agent, which has value one. The second channel of this image is a h × w matrix whose elements are all zero except for the element that corresponds to the start tile, which has value one. The third channel of this image is a h × w matrix whose elements are all zero except for the elements that correspond to hole tiles, which have value one. The fourth channel of this image is a h × w matrix whose elements are all zero except for the element that corresponds to the goal tile, which has value one. The method decode policy receives a neural network obtained by the deep Q-network learning algorithm and returns the corresponding greedy policy together with its value function estimate.

Finally, the function deep q network learning combines the previous classes to implement the deep Q-network learning algorithm.

6 Main function

Your final implementation task is to write a program that uses all the algorithms that you have implemented for this assignment. Your main function should behave analogously to the function presented in Listing 6. Using the small frozen lake as a benchmark, find and render optimal policies and values using policy iteration, value iteration, Sarsa control, Q-learning control, linear Sarsa control, linear Q-learning, and deep Q-network learning. For marking purposes, if your main function does not call one of these algorithms, we will assume that it is not implemented correctly.