Course manual 2025/2026

Course content

Reinforcement learning is a general framework studying sequential decision-making problems. In such problems, at every time step an action must be chosen to optimize long-term performance. This is a very wide class of problems that includes robotic control, game playing, but also human and animal behavior. Reinforcement learning methods can be applied when no training labels for the optimal action are available, and good actions have to be discovered through trial and error.

In this course, we will discuss properties of reinforcement learning problems and algorithms to solve them. In particular, we will look at

Solving problems with discrete action sets using so-called value-based methods. We will cover approximate methods that can be employed in problems with large state spaces.
Solving problems with continuous actions using approximate policy-based methods. These methods have application in, among others, robotics and control.
The use of multi-layer neural networks as function approximators in reinforcement learning algorithms, yielding so-called deep reinforcement learning methods
We will also briefly cover advanced & frontier topics in reinforcement learning.

Study materials

Literature

Reinforcement Learning: An Introduction. R. S. Sutton & A. G. Barto
Second Edition. Available: http://incompleteideas.net/book/RLbook2020.pdf. We will cover or partially cover chapters 1-6, 8-11, 13, 16, and 17.
A Survey on Policy Search for Robotics. M. P. Deisenroth, G. Neumann, J. Peters. We will cover this survey partially. Available:

https://www.deisenroth.cc/pdf/fnt_corrected_2014-08-26.pdf
We will study new developments in the field of RL through recent publicly available papers. Links will be distributed through the class website.

Objectives

The student is able to describe the main algorithms in Monte Carlo, temporal difference, and model-based and policy-based methods
The student is able to describe the main differences between on&off policy learning, TD&Monte Carlo methods, value-based and policy-based methods, tabular and approximate methods; and categorise reinforcement learning methods according to these properties
The student is able to implement reinforcement learning algorithms
The student is able to compare reinforcement learning algorithms and point out their main advantages and disadvantage
The student is able to critically evaluate the performance of reinforcement learning algorithms (or lack thereof) in a given environment
The student is able to critically evaluate reinforcement learning experiments and point out their strong points and weak points

Teaching methods

Lecture
Seminar

In the lectures the theoretical background will be covered, coupled the building up of an intuitive understanding of how methods relate to each other through examples and explanation.

The practical sessions focus on applying and practicing the techniques from the lecture.

Learning activities

Activity	Hours
Hoorcollege	28
Laptopcollege	14
Tentamen	3
Werkcollege	14
Self study	109
Total	168	(6 EC x 28 uur)

Attendance

This programme does not have requirements concerning attendance (OER part B).

Additional requirements for this course:

Attendance to lectures and tutorial sessions is strongly encouraged but not required.

Assessment

Item and weight	Details
Final grade
0.65 (65%) Tentamen
0.35 (35%) Praktische oefening
4 (11%) Homework 1
4 (11%) Homework 2
4 (11%) Homework 3
4 (11%) Homework 4
4 (11%) Homework 5
5 (14%) Homework 6 / empirical RL
2 (6%) Lab 1 - Dynamic Programming
2 (6%) Lab 2 - Monte Carlo
2 (6%) Lab 3 - Temporal Difference
2 (6%) Lab 4 - Deep Q-Network
2 (6%) Lab 5 - Policy Gradient

A resit is possible for the exam only.

A result of at least 5.0 on the exam is necessary to pass the course. Of course, the weighted average of assignments and exam needs to be at least 5.5 to pass the course.

The final grade is a weighted average of assignments and the exam as follows:

Exam 65%
5 homeworks, 4% each
5 coding assignments, 2% each
homework "Empirical RL report", 5%

Assignments handed in after the assignment deadline without permission might not be graded and might not be awarded full points. Please see the course policy in the syllabus on Canvas.

The grade for the practical assignments (homeworks, coding assignments) cannot be re-taken. In case the re-sit exam is made, the final grade is calculated from the new exam grade with the original grade for the practical assignments, according to the original weights.

Inspection of assessed work

Exam results and assignment results can be inspected online (via Ans.app and codegrade). For questions about the assessment of the exam, an announcement will be made on Canvas about the possibilities. For questions about the assessment of practical assignments, please ask your TA.

Assignments

Both graded and ungraded assignments are provided. Only the graded assignments should be handed in.

The answers to ungraded assignments will be provided one week after the assignment was scheduled so that students can check their own work.

Practical assignments (homework assignments, coding assignments and the empirical RL report), can be done in groups of 2 or individually. Feedback will be given in Canvas and additional feedback can be given on request by the TA during tutorial sessions.

Fraud and plagiarism

The 'Regulations governing fraud and plagiarism for UvA students' applies to this course. This will be monitored carefully. Upon suspicion of fraud or plagiarism the Examinations Board of the programme will be informed. For the 'Regulations governing fraud and plagiarism for UvA students' see: www.student.uva.nl

Course structure

T = tutorial (werkcollege), L = lecture

RL:AI = Reinforcement Learning: An introduction, Ex = ungraded exercise, HW = homework

Note: under "Modules" on Canvas a slightly more detailed syllabus is available which includes homework and exam dates.

Lecture	Topic	Literature
T 1	Set-up programming environment, prior knowledge self-test	Ex. 0.1 & 0.2
L 1	Introduction. MDP & Bandit	RL:AI 1.1-1.4, 1.6, 2.1-2.4, 2.6, 2.7, 3.1-3.3
T 2		Ex. 1.1-1.3
L 2	Dynamic programming	RL:AI 2.5, 3.4-3.8, 4
T 3		Ex. 2.1-2.2, HW 2.3 & 2.4
L 3	Monte-Carlo methods	RL:AI 5.1-5.7
T 4		Ex. 3.1-3.3, HW 3.4
L 4	Temporal difference methods	RL:AI 6.1-6.5
T 5		Ex. 4.1-4.2, HW 4.3
L 5	From tabular learning to approximation	RL:AI 9.1 - 9.3
T 6		Ex. 5.1, 5.2, HW 5.3 & 5.4
L 6	On-policy temporal difference learning with approximation	RL:AI 9.3-9.8
T7		Ex. 6.1-6.4; HW 6.5
L7	Off-policy RL with approximation	RL:AI 10.1, 11.1-11.7
T 8		Ex. 7.1-7.3 HW 7.4
L 8	Deep RL (value-based methods)	RL:AI 16.5 ; DQN paper and CQL paper
T 9		8.1, HW 8.2 & 8.3
L 9	Policy gradient methods: REINFORCE	RL:AI 13.1-13.4, 13.7; Survey 1 - 2.4.1.2
T 10		Ex. 9.1-9.3, HW 9.4
L 10	Policy gradient methods: PGT, DPG & evaluation	13.5, RL that matters paper , Empirical design paper, DPG paper
T 11		Ex. 10.1 - 10.2, HW 10.3 & 10.4
L 11	Advanced policy-based methods: Soft actor critic and return-conditioned policies	SAC paper, decision transformer paper, decision diffuser paper
T 12		Ex. 11.1 - 11.3
L 12	Planning and learning	RL:AI 8.1, 8.2, 8.8, 8.10, 8.11, 8.13, 16.6, AlphaGo paper
T 13		Ex. 12.1 & 12.2 Work on ERL assignment
L 13	Partial observability	RL:AI 17.3
T 14	FAQ session (exam and reproducible research assignment)	Ex. 13.1, ERL assignment
	Hand in HW 6 (ERL assignment)!
L 14	Recap & Exam FAQ

Contact information

Coordinator

dr. H.C. van Hoof

For questions regarding assignment to tutorial groups please see the Canvas announcement and contact Alejandro Munoz.

Questions about the content of lectures or exercises can be asked on the course Ed Discussion page (see link on Canvas).

For sensitive or private questions please contact the course coordinator using e-mail or a message on Canvas.

Owner	Master Artificial Intelligence
Coordinator	dr. H.C. van Hoof
Part of	Master Artificial Intelligence,
Links	Visible Learning Trajectories