Course manual 2025/2026

Course content

This highly technical course focuses on the preparation and life-cycle management of data for production machine learning deployments. The course starts by recapping fundamentals about relational data processing and dataflow systems. Subsequently, students learn about encoding, storing and managing vectorised feature representations of heterogeneous input data sources for machine learning applications, and the architecture of current state-of-the- art systems for this task such as Google’s Tensorflow Extended Platform. Concurrently, the students will be exposed to foundational theory for this problem space, such as incremental view maintenance for relational data, fine-grained data provenance tracking via provenance semi-rings and differential computation.

In addition, students will learn to identify, quantify and address common quality issues with respect to the completeness and consistency of the data. Furthermore, they will learn about technical challenges with respect to the compliance with regulations for private data such as the “right-to-be-forgotten” from GDPR. Finally, students will be exposed to ongoing research efforts in this space such as ML pipeline debugging or error detection techniques from data-centric AI. In addition, they will have the opportunity to discuss the practical implications of the covered technologies with invited industry experts.

Study materials

Literature

  • Scientific papers

  • Book chapters

  • Presentation slides

Syllabus

  • Detailed information about the course and grading will be discussed in the first lecture

Practical training material

  • Programming assignments

  • Examplde code

Software

  • Java and Python-based open source software

Objectives

  • Describe the lifecycle of data in systems employing machine learning and predictive analytics.
  • Implement efficient data preparation programs using state-of-the-art relational and dataflow processing systems.
  • Identify and potentially correct data issues related to data quality, privacy violations or technical bias.
  • Design and validate a scalable data architecture for preparing and maintaining data for predictive analytics.

Teaching methods

  • Lecture
  • Working independently on e.g. a project or thesis
  • Presentation/symposium
  • Self-study
  • Computer lab session/practical training

Learning activities

Activity

Hours

Hoorcollege

12

Laptopcollege

8

Presentations 6
Self-Study 120
     

Attendance

  • Some course components require compulsory attendance. If compulsory attendance applies, this will be indicated in the Course Catalogue which can be consulted via the UvA-website. The rationale for and implementation of this compulsory attendance may vary per course and, if applicable, is included in the Course Manual.
  • Additional requirements for this course:

    Participation will be measured. Attendance in the lab sessions is needed in order to attain the programming skills and background required for the assignments and the project.

    Assessment

    Item and weight Details

    Final grade

    Details for the grading of the assignments and project will be made available during the course.

    Assignments

    • Three individual programming assignments
    • Project with presentation and paper to be conducted in groups of 3-4 students

    Fraud and plagiarism

    The 'Regulations governing fraud and plagiarism for UvA students' applies to this course. This will be monitored carefully. Upon suspicion of fraud or plagiarism the Examinations Board of the programme will be informed. For the 'Regulations governing fraud and plagiarism for UvA students' see: www.student.uva.nl

    Course structure

    Weeknummer Onderwerpen Studiestof
         
         
         
         
         
         
         
         

    Contact information

    Coordinator

    • dr. H. Harmouch

    Staff

    • Antonios Georgakopoulos
    • D.I. Jackson MSc