Course manual 2021/2022

Course content

In data science and real-world machine learning, there are many issues that are often neglected in standard machine learning courses. In this course we will focus on these two aspects:

(i) many tasks are inherently trying to answer causal questions and gather actionable insights, even when there is not enough data to draw causal conclusions;

(i) data is often missing not at random, heterogenous or not i.i.d.

For the first issue, we will focus on formulating the correct causal questions and assumptions needed to solve the real-world task at hand. For example, a strong correlation between two variables X and Y is not enough to decide a policy in which we change X and expect to see an increase in Y (i.e. “correlation is not causation”). On the other hand, if we measure another variable Z that we know causes X, but does not have an effect on Y (i.e. an instrumental variable), we can discover under certain assumptions that X is the cause of Y, even if we haven’t performed any experiment. In the course we will learn about causal discovery, which extends this case to multiple variables and multiple observational and experimental datasets, and about causal effect estimation, which describes the type of causal effect a variable X has on another variable Y.

In particular, we will discuss how to interpret the output of existing methods and their assumptions, as well as the concept of identifiability, i.e. when one can answer the relevant causal relations with the data at hand, or which new data or experiments may be required.

To address the second issue, we will look into data fusion methods based on causal graphs, showing that they can represent correctly different distributions without inducing any wrong conclusion. In particular we will show how one can apply these methods to transfer learning and domain adaptation tasks.

While the lectures will provide the theoretical foundations, the course project will allow small teams of students to apply these concepts in a simplified real-world setting, with additional practical guidance in terms of existing tools during the lab assignments.

Study materials

Syllabus

- A syllabus containing articles and chapters will be made available at the beginning of the course.
  
  The syllabus/course will cover parts of the following books: - Spirtes, Glymour, Scheines (2000): Causation, prediction, search. Causation, prediction and search. MIT Press.
  
  - Jonas Peters, Dominik Janzing, Bernhard Schölkopf (2017): Elements of Causal Inference: Foundations and Learning Algorithm. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press.
  
  - Hernan, Robins (2020). Causal inference. Causal inference. Chapman & Hall.

Objectives

Understand the high impact and potential of causal inference, as well as its limitations
Identify the correct causal questions and assumptions in a given data science task
Analyze and interpret the outputs of existing causal inference tools
Understand which answers can and which cannot be answered with the current data (identifiability) and which experiments/further data sources could help
Combine different tools in order to solve more complex causal questions
[Optionally] implement simple extensions to existing algorithms in causal discovery, causal effect estimation and applications of causality to machine learning

Teaching methods

Lecture
Computer lab session/practical training
Self-study
Presentation/symposium
Working independently on e.g. a project or thesis

Learning activities

Activity	Hours
Lectures	24
Practicals	14
Presentation	4
Self study	126
Total	168	(6 EC x 28 uur)

Attendance

In TER part B of this programme no requirements regarding attendance are mentioned.

Assessment

Item and weight	Details
Final grade

The assessment of the course consists of three parts

Quizzes on theory and in-class participation (10%)
Course project presentation (20%) - group grade
Course project report (70%) - individual grade

The final grade will be a weighted average of the grades in each part. The passing grade is a final grade >= 5.5.

Fraud and plagiarism

The 'Regulations governing fraud and plagiarism for UvA students' applies to this course. This will be monitored carefully. Upon suspicion of fraud or plagiarism the Examinations Board of the programme will be informed. For the 'Regulations governing fraud and plagiarism for UvA students' see: www.student.uva.nl

Course structure

Week number	Topic
1	Introduction and Probability Recap
2	Causal graphs and Interventions
3	Covariate adjustment
4	Potential outcomes
5	Causal Discovery
6	Advances topics (causality-inspired ML)
7	Presentations of projects
8	No class - exam week
1st April 2022	Paper deadline

Owner	Master Information Studies
Coordinator	dr. Sara Magliacane
Part of	Master Information Studies, track Data Science, year 1 Master Information Studies, track Information Systems, year 1

Causal Data Science