Reinforcement Learning Final Project
2023-05-11
1 Introduction
The goal of the final project is to implement two kinds of model-free RL methods:
value-based RL and policy-based RL. In this project, you are free to choose RL
methods to solve two benchmark environments.
2 Review
2.1 Value-Based Reinforcement Learning
Value-based methods strive to t action value function or state value function,
e.g. Monte-Carlo, TD learning for model-free policy evaluation and SARSA, Q-
learning for model-free control. O-policy training mode is easy to implement in
value-based method. DQN achieves remarkable performance under o-policy.
In DQN (Silver, 2015), past experiences that stored in experience buer can
be used to train the deep Q network. In many transfer algorithms for DQN,
expert's experiences are often used to t the current value function. Hence
value-based methods are often more sample ecient.
Although value-based RL like DQN and its variants achieve remarkable per-
formance in some task, e.g. atari games, the inherent drawbacks hinder its
development.
First, action selection in value-based methods is according to the action
values, which is inherently unsuited to continuous action space.
Second, non-linear value function approximation like neural network is
unstable and brittle with respect to their hyperparameters.
2.2 Policy-Based Reinforcement Learning
In original policy gradient rlog(st;at)vt, returnvtis the unbiased esti-
mation of expected long-term value Q(s;a) following a policy (s) (Actor).
However, original policy gradient suers from high variance. Actor-Critic al-
gorithm uses Q value function Qw(s;a), named Critic, to estimate Q(s;a).
Though Critic may introduce bias, it can dramatically reduce variance and
proper chose of function approximation may avoid it.
The biggest drawback for policy gradient methods is sample ineciency: since
policy gradients are estimated from rollouts. Although actor-critic methods use
value approximators (Critic) instead of rollouts, its on-policy style remains sam-
ple inecient. Prior works, such as DDPG (Lillicrap, 2015), Soft Actor-Critic
(Haarnoja, 2018) strive to introduce o-policy mode to Actor-Critic.
3 Experiment Environments and Requirements
OpenAI provides benchmark environments toolkit gym' to facilitate the de- velopment of reinforcement learning. 8 types of experiment environments are 2 available (access https://gym.openai.com/envs/#atari for more). In our project, you are required to train agents over Atari and MuJuCo. You should choose appropriate and eective RL methods to achieve high scores in the en- vironments as possible as you can. To get started with gym, refer to https: //github.com/openai/gym . 3.1 Atari Games Environment Description The Atari 2600 is a home video game console developed in 1977. Dozens of games are provided by
gym'. In our project, we limit the choice of environment
to the following:
VideoPinball-ramNoFrameskip-v4
BreakoutNoFrameskip-v4
PongNoFrameskip-v4
BoxingNoFrameskip-v4
You should at least choose one environment to test your value-based method.
3.2 MuJuCo Continuous Control Environment Descrip-
tion
MuJuCo stands for Multi-Joint dynamics with Contact, which is originally de-
signed for model-based control methods test. Now, MuJoCo simulator is a
commonly adopted benchmark for continuous control. We narrow down the
choice of environments to the following:
Hopper-v2
Humanoid-v2
HalfCheetah-v2
Ant-v2
You should at least choose one environment to test your policy-based method.
3.3 Requirements
Here is the experiment content:
You are required to choose and implement value-based RL algorithms and
test them on at least one of the Atari game listed above.
You are required to choose and implement policy-based RL algorithms
and test them on at least one of the MuJuCo simulator listed above.
3
The algorithms you choose in the scope of value-based and policy-based are
non-limited. For the ease of running your submitted codes and grading, we
have some limitations in this project.
Programming language: python3
The nal results should use the experiment Name like following:
python run.py - -env name BreakoutNoFrameskip-v4
4 Report and Submission
4.1 About Submission
You are required to accomplish this project individually.
Your report and source code should be zipped and named after "Name StuID".
Besides, README le that shows the instructions to run your code should
be included inside the zip le.
The submission deadline is June 8, 2023.
4.2 About Report
The report should cover but not be limited to the following sections:
The description of the algorithms you use.
The performance of the algorithms you achieve in selected environments
The analysis about the algorithms.
4.3 Bonus
Modication of the algorithms that achieves better performance.
Test your algorithms on more than one environment.
Excellent analysis about the algorithms.