1. Homepage
2. Programming
3. COMP9414 Artificial Intelligence Assignment 1 - Reward-based learning agents

# COMP9414 Artificial Intelligence Assignment 1 - Reward-based learning agents

AustraliaUNSWCOMP9414Artificial IntelligenceReward-based learning agentsSARSAQ-learningSoftmaxPython

COMP9414 23T2 Artificial Intelligence

# Assignment 1 - Reward-based learning

## 1 Activities

In this assignment, you are asked to implement a modified version of the temporal-difference method Q-learning and SARSA. Additionally, you are asked to implement a modified version of the action selection methods soft- max and ϵ-greedy. To run your experiments and test your code you should make use of the example gridworld used for Tutorial 3 (see Fig. 1). The modification of the method includes the following two aspects: •Random numbers will be obtained sequentially from a file. •The initial Q-values will be obtained from a file as well. The random numbers are available in the file random numbers.txt . The file contains 100k random numbers between 0 and 1 with seed = 9999 created with numpy.random.random as follows: import numpy as np np.random.seed(9999) random_numbers=np.random.random(100000) np.savetxt("random_numbers.txt", random_numbers)

### 1.1 Implementing modified SARSA and ϵ-greedy

For the modified SARSA you must use the code review during Tutorial 3 as a base. Consider the following: •The method will use a given set of initial Q-values, i.e., instead of initialising them using random values the initial Q-values should be obtained from the file initial Qvalues.txt . You must load the values using np.loadtxt(initial Qvalues.txt) . •The initial state for the agent before the training will be always 0. For the modified ϵ-greedy, create an action selection method that receives the state as an argument and returns the action. Consider the following: •The method must use sequentially one random number from the pro- vided file each time, i.e., a random number is used only once. •In case of a random number rnd < =ϵthe method returns an ex- ploratory action. We will use the next random number to decide what action to return, as shown in Table 1. •You should keep a counter for the random numbers, as you will need it to access the numbers sequentially, i.e., you should increase the counter every time after using a random number. 2 Random number ( r)Action Action code r <= 0.25 down 0 0.25< r < = 0.5 up 1 0.5< r < = 0.75 right 2 0.75< r < = 1 left 3 Table 1: Exploratory action selection given the random number.

### 1.2 Implementing Q-learning and softmax

You should implement the temporal-difference method Q-learning. Consider the following for the implementation: •For Q-learning the same set of initial Q-values will be used (provided in the file initial Qvalues.txt ). •Update the Q-values according to the method. Remember this is an off-policy method. •As in the previous case, the initial state before training is also 0. For the softmax action selection method, consider the following: •Use a temperature parameter τ= 0.1. •Use a random number from the provided file to compare it with the cu- mulative probabilities to select an action. Hint: np.searchsorted returns the position where a number should be inserted in a sorted array to keep it sorted, this is equivalent to the action selected by softmax. •Remember to use and increase a counter every time you use a random number.

### 1.3 Testing and plotting the results

You should plot a heatmap with the final Q-values after 1,000 learning episodes. Additionally, you should plot the accumulated reward per episode and the number of steps taken by the agent in each episode. For instance, if you want to test your code, you can use the gridworld shown in Fig. 1 and you will obtain the rewards shown in Fig. 2 and the 3 0 200 400 600 800 1000 Episodes1.00 0.75 0.50 0.25 0.000.250.500.751.00RewardAccumulated Reward Reward(a) Q-learning + ϵ-greedy. 0 200 400 600 800 1000 Episodes2.0 1.5 1.0 0.5 0.00.51.0RewardAccumulated Reward Reward (b) Q-learning + softmax 0 200 400 600 800 1000 Episodes2.0 1.5 1.0 0.5 0.00.51.0RewardAccumulated Reward Reward (c) SARSA + ϵ-greedy. 0 200 400 600 800 1000 Episodes2.0 1.5 1.0 0.5 0.00.51.0RewardAccumulated Reward Reward (d) SARSA + softmax. Figure 2: Accumulated rewards. steps shown in Fig. 3. The learning parameters used are: learning rate α= 0.7, discount factor γ= 0.4,ϵ= 0.25, and τ= 0.1. In case you want to compare your results with the exact output for this example using diff , four files with the accumulated reward and four files with the steps per episode are provided (the combination of using Q- learning/SARSA and ϵ-greedy/softmax). To mark your submission, different gridworlds and learning parameters will be used as test cases.

0 200 400 600 800 1000 Episodes681012141618StepsSteps per Episode Steps(a) Q-learning + ϵ-greedy. 0 200 400 600 800 1000 Episodes10203040506070StepsSteps per Episode Steps (b) Q-learning + softmax. 0 200 400 600 800 1000 Episodes0255075100125150175200StepsSteps per Episode Steps (c) SARSA + ϵ-greedy.. 0 200 400 600 800 1000 Episodes1020304050607080StepsSteps per Episode Steps (d) SARSA + softmax. Figure 3: Steps per episode.

## Get in Touch with Our Experts

QQ
WeChat
Whatsapp
Australia代写,UNSW代写,COMP9414代写,Artificial Intelligence代写,Reward-based learning agents代写,SARSA代写,Q-learning代写,Softmax代写,Python代写,Australia代编,UNSW代编,COMP9414代编,Artificial Intelligence代编,Reward-based learning agents代编,SARSA代编,Q-learning代编,Softmax代编,Python代编,Australia代考,UNSW代考,COMP9414代考,Artificial Intelligence代考,Reward-based learning agents代考,SARSA代考,Q-learning代考,Softmax代考,Python代考,Australiahelp,UNSWhelp,COMP9414help,Artificial Intelligencehelp,Reward-based learning agentshelp,SARSAhelp,Q-learninghelp,Softmaxhelp,Pythonhelp,Australia作业代写,UNSW作业代写,COMP9414作业代写,Artificial Intelligence作业代写,Reward-based learning agents作业代写,SARSA作业代写,Q-learning作业代写,Softmax作业代写,Python作业代写,Australia编程代写,UNSW编程代写,COMP9414编程代写,Artificial Intelligence编程代写,Reward-based learning agents编程代写,SARSA编程代写,Q-learning编程代写,Softmax编程代写,Python编程代写,Australiaprogramming help,UNSWprogramming help,COMP9414programming help,Artificial Intelligenceprogramming help,Reward-based learning agentsprogramming help,SARSAprogramming help,Q-learningprogramming help,Softmaxprogramming help,Pythonprogramming help,Australiaassignment help,UNSWassignment help,COMP9414assignment help,Artificial Intelligenceassignment help,Reward-based learning agentsassignment help,SARSAassignment help,Q-learningassignment help,Softmaxassignment help,Pythonassignment help,Australiasolution,UNSWsolution,COMP9414solution,Artificial Intelligencesolution,Reward-based learning agentssolution,SARSAsolution,Q-learningsolution,Softmaxsolution,Pythonsolution,