Course manual 2021/2022

Course content

Data is at the center of modern businesses and institutions, and we are experiencing a big shift towards predictive, data-driven decision making in recent years. This development has given rise to Big Data, a novel set of data storage and processing methods, accompanied by a new software stack that serves as the foundation for the ongoing AI revolution. Pioneered by companies building web-scale search engines such as Google, Big Data technologies are used across all industries nowadays and are a major component of modern cloud infrastructure.

In this course, we study various abstractions, processing techniques and Big Data systems for working with large collections of data, including relational database management systems, MapReduce and Apache Spark. We address the foundations of Big Data applications, as well as topics like choosing a suitable data representation and implementing distributed processing operations. We review and focus on the foundations of distributed data processing and programming models for parallelisable programs, as well as on data quality and responsible data management.

You will learn theoretical concepts for data processing during the lectures. In the lab you will gain hands-on experience through a number of coding assignments and by participating in a Kaggle-like Big Data competition. In addition, we offer online tutorials for hands-on experience with established industry software packages. Finally, experts from the field (both academic as well as industry colleagues) are invited for giving guest lectures.

Study materials

Literature

  • Book Chapters

  • Scientific Papers

  • Presentation Slides

Syllabus

  • A syllabus containing articles and chapters will be made available at the beginning of the course. 

Practical training material

  • Programming assignments

  • Kaggle-like data project

  • Video Tutorials

Software

  • Python

  • Jupyter

  • DuckDB

  • PySpark

Other

  • Videos

Objectives

  • Explain the high impact and potential of big data technology
  • Create a scalable Big Data application for a given scenario
  • Analyse data schemas and distributed data processing strategies
  • Design a scalable Big Data application for a machine learning task and present findings in a poster session
  • Explain what relates and differentiates relational data processing, MapReduce and Resilient Distributed Datasets
  • Describe common data quality issues and apply error detection and data cleaning methods
  • Program key Big Data systems

Teaching methods

  • Lecture
  • Self-study
  • Computer lab session/practical training
  • Presentation/symposium
  • Working independently on e.g. a project or thesis

Learning activities

Activity

Hours

 

Hoorcollege

14

 

Laptopcollege

14

 

Presentatie

6

 

Werkcollege

14

 

Self study

120

 

Total

168

(6 EC x 28 uur)

Attendance

In TER part B of this programme no requirements regarding attendance are mentioned.

Additional requirements for this course:

Participation will be measured. Attendance in the lab sessions is highly recommended in order to attain the programming skills and background required for the assignments and the project.

Assessment

Item and weight Details

Final grade

55%

Tentamen

40%

Project Presentation

5%

(Lab) Assignments

To pass the course, all parts should be passed. The exam will be an open book exam (based on the materials provided on canvas), and, in case of a resit, the resit grade will count.

Assignments

Programming assignments 1 & 2

  • Individual assignments, will be auto-graded with immediate feedback.

Open questions

  • Individual assignment, will be graded on canvas.

Project poster presentation

  • Group assignment, will be graded via a poster session.

Fraud and plagiarism

The 'Regulations governing fraud and plagiarism for UvA students' applies to this course. This will be monitored carefully. Upon suspicion of fraud or plagiarism the Examinations Board of the programme will be informed. For the 'Regulations governing fraud and plagiarism for UvA students' see: www.student.uva.nl

Course structure

Week Lecture Lab
Online Tech Tutorials
1 Intro & Foundations Intro & Setup -
2 Relational Data Processing Lab Assignment 1: SQL Version Control with Git
3 MapReduce Lab Assignment 2: MapReduce / Spark DuckDB
4 Resilient Distributed Datasets Project week 1 Apache Spark Deep Dive
5 Data Cleaning Project week 2 Great Expectations
6 Responsible Data Management Project week 3 AIF 360
7 Big Data at bol.com 4 Kubernetes
8 Exam Poster Session -

Timetable

The schedule for this course is published on DataNose.

Additional information

The course will be taught in English. 

Basic programming skills, knowledge of computing systems and machine learning are required. 

Contact information

Coordinator

  • dr. ing. S. Schelter