Python代写|RL 2022/2023 Coursework

1 Introduction

The goal of this coursework is to implement different reinforcement learning algorithms covered in the lectures. By completing this coursework, you will get first-hand experience on how different algorithms perform in different decision-making problems.

Throughout this coursework, we will refer to lecture slides for your understanding and give page numbers to find more information in the RL textbook (”Reinforcement Learning: An Introduction (2nd edition)” by Sutton and Barto,

As stated in the course prerequisites, we do expect students to have a good understanding of Python programming, and of course any material covered in the lectures is the core foundation to work on this coursework. Many tutorials on Python can be found online.

We encourage you to start the coursework as early as possible to have sufficient time to ask any questions.

2 Contact

Piazza Please post questions about the coursework in the Piazza forum to allow everyone to view the answers in case they have similar questions. We provide different tags/folders in Piazza for each question in this coursework. Please post your questions using the appropriate tag to allow others to easily read through all the posts regarding a specific question.

Lab sessions There will also be lab sessions in person, during which you can ask questions about the coursework. We highly recommend attending these sessions, especially if you have questions about PyTorch and the code base we use. The lab sessions schedule can be accessed at this link.

Note Please keep in mind that Piazza questions and lab sessions are public for discussions. Given that this coursework is individual work and graded, please do not disclose or discuss any information which could be considered a hint towards or part of the solution to any of the questions. However, you can ask and we encourage any questions about instructions that are unclear to you, questions generally asking about algorithms (disconnected from their implementation) and concepts. Please, always ask yourself prior to posting whether you believe your question in itself discloses implementation details or might provoke answers disclosing such information.

We understand that Piazza is a very valuable place to discuss many matters on this course between students and teaching staff, but also between students. Particularly at these times,where exchange among students is severely limited due to (mostly) remote teaching, Piazza can be one of the few places such exchange can be done. We are committed to make this exchange as simple and effective as possible and hope you keep these boundaries in mind about questions regarding the coursework.

3 Getting Started

To get you started, we provide a repository of code to build upon. Each question specifies which sections of algorithms you are expected to implement and will point you to the respective files.

The code base is fully written in Python and we expect you to use several standard machine learning packages to write your solutions with. Therefore, start by downloading Python to your local machine. We recommend you use at least Python version 3.8.

Python can be installed using the official installers ( or alternatively using a respective package-manager on Linux or Homebrew ( on macOS.

After installing Python, we highly recommend creating a virtual environment (below we provide instructions for virtualenv, another common alternative is conda) to install the required packages. This allows you to neatly organise the required packages for different projects and avoid potential issues caused by insufficient access permissions on your machines.

On Linux or macOS machines, type the following command in your terminal:

python3 -m venv < environment name >

You should now see a new folder with the same name as the environment name you provided in the previous command. In your current directory, you can then execute the following command to activate your virtual environment on Linux or macOS machines:

source < environment name >/ bin / activate

If you are using Windows, please refer to the official Python guide for detailed instructions.

Finally, execute the following command to download the code base:

git clone https :// github . com / uoe – agents / uoe – rl2023 – coursework .git

Navigate to <Coursework directory with setup> and execute the following command to install the code base and the required dependencies:

pip3 install -e .

Note that you may encounter problems during the installation of the above packages on macOS Ventura. If that happens, please try updating your macOS to Ventura 13.1 and Xcode to 14.2.

For detailed instructions on Python’s library manager pip and virtual environments, see the official Python guide and this guide to Python’s virtual environments.

4 Overview

The coursework contains a total of 100 marks and counts towards 50% of the course grade.

Below you can find an overview of the coursework questions and their respective marks. More details on required algorithms, environments and required tasks can be found in Section 5. Submissions will be marked based on correctness and performance as specified for each question. In Questions 2, 3 and 5, some marks are given based on a short write-up or an answer to a multiple-choice question. When relevant, you will be instructed to provide these answers as the output of a dedicated function in the answer script located at the root of the rl2023 directory (refer to Figure 6 for a breakdown of the folder structure). Details on marking can be found in Section 6 and Section 7 presents instructions on how to submit the required assignment files.

Question 1 – Dynamic Programming [15 Marks]

Value Iteration [7.5 Marks]

Policy Iteration [7.5 Marks]

Question 2 – Tabular Reinforcement Learning [20 Marks]

Q-Learning [7 Marks]

On-policy first-visit Monte Carlo [7 Marks]

Question 3 – Deep Reinforcement Learning [32 Marks]

Deep Q-Networks [6 Marks]


Implement ϵ-scheduling strategies [4 Marks]

Select best hyperparameter profiles [2 Marks]

Answer questions on ϵ-scheduling [4 Marks]

Question 4 – Continuous Deep Reinforcement Learning [18 Marks]

Question 5 – Fine-tuning the Algorithms [15 Marks]

5 Questions

Question 1 – Dynamic Programming [15 Marks]


The aim of this question is to provide you with better understanding of dynamic programming approaches to find optimal policies for Markov Decision Processes (MDPs). Specifically, you are required to implement the Policy Iteration (PI) and Value Iteration (VI) algorithms.

For this question, you are only required to provide implementation of the necessary functions. For each algorithm, you can find the functions that you need to implement under

Tasks below. Make sure to carefully read the code documentation to understand the input and required outputs of these functions. We will mark your submission only based on the correctness of the outputs of these functions.


You can find more details including pseudocode in the RL textbook on page 80. Also see

Lecture 4 on dynamic programming (pseudocode on slide 17).

You can find more details including pseudocode in the RL textbook on page 83. Also see

Lecture 4 on dynamic programming (pseudocode on slide 22).


In this exercise, we train dynamic programming algorithms on MDPs. We provide you with functionality which enables you to define your own MDPs for testing. For an example on how to use these functions, see the main function at the end of exercise1/mdp where the

”Frog on a Rock“ MDP from the tutorials shown in Figure 1 is defined and given as input to the training function with γ = 0.8.

As a side note, our interface for defining custom MDPs requires all actions to be valid over all states in the state space. Therefore, remember to include a probability distribution over next states for every possible state-action pair to avoid any errors from the interface.


Use the code base provided in the directory exercise1 and implement the following functions.

[7.5 Marks]

To implement the Value Iteration algorithm, you must implement the following functions in the ValueIteration class:

[7.5 Marks]

To implement the Policy Iteration algorithm, you must implement the following functions in the PolicyIteration class:

Aside from the aforementioned functions, the rest of the code base for this question must be left unchanged. A good starting point for this question would be to read the code base and the documentations to get a better grasp how the entire training process works.

Directly run the file mdp to print the calculated policies for VI and PI for a test MDP. Feel free to tweak or change the MDP and make sure it works consistently.

This question does not require a lot of effort to complete and you can provide a correct implementation with less than 50 lines of code. Additionally, training the method should require less than a minute of running time.

Question 2 – Tabular Reinforcement Learning [20 Marks]


The aim of the second question is to provide you with practical experience on implementing model-free reinforcement learning algorithms with tabular Q-functions. Specifically, you are required to implement the Q-Learning and on-policy first-visit Monte Carlo algorithms.

For all algorithms, you are required to provide implementations of the necessary functions. You can find the functions that you need to implement below. Make sure to carefully read the documentation of these functions to understand their input and required outputs. We will mark your submission based on the correctness of the outputs of the required functions, the performance of your learning agents measured by the average returns on the Taxi-v3 environment, and the answers you’ve provided in answer


You can find more details including pseudocode for QL in the RL textbook on page 131.

Also see Lecture 6 on Temporal Difference learning (slide 19).

You can find more details including pseudocode for on-policy first-visit MC with ϵ-soft policies in the RL textbook on page 101. Also see Lecture 5 on MC methods (slide 17).


In this question, we train agents on the OpenAI Gym Taxi-v3 environment. This environment is a simple task where the goal of the agent is to navigate a taxi (yellow box – empty taxi; green box – taxi with passenger) to a passenger (blue location), pick it up and drop it off at the destination (purple location) in a grid-world.