Course manual 2021/2022

Course content

Data is at the center of modern businesses and institutions, and we are experiencing a big shift towards predictive, data-driven decision making in recent years. This development has given rise to Big Data, a novel set of data storage and processing methods, accompanied by a new software stack that serves as the foundation for the ongoing AI revolution. Pioneered by companies building web-scale search engines such as Google, Big Data technologies are used across all industries nowadays and are a major component of modern cloud infrastructure.

In this course, we study various abstractions, processing techniques and Big Data systems for working with large collections of data, including relational database management systems, MapReduce and Apache Spark. We address the foundations of Big Data applications, as well as topics like choosing a suitable data representation and implementing distributed processing operations. We review and focus on the foundations of distributed data processing and programming models for parallelisable programs, as well as on data quality and responsible data management.

You will learn theoretical concepts for data processing during the lectures. In the lab you will gain hands-on experience through a number of coding assignments and by participating in a Kaggle-like Big Data competition. In addition, we offer online tutorials for hands-on experience with established industry software packages. Finally, experts from the field (both academic as well as industry colleagues) are invited for giving guest lectures.

Study materials

Literature

Book Chapters
Scientific Papers
Presentation Slides

Syllabus

A syllabus containing articles and chapters will be made available at the beginning of the course.

Practical training material

Programming assignments
Kaggle-like data project
Video Tutorials

Software

Python
Jupyter
DuckDB
PySpark

Other

Videos

Objectives

Explain the high impact and potential of big data technology
Create a scalable Big Data application for a given scenario
Analyse data schemas and distributed data processing strategies
Design a scalable Big Data application for a machine learning task and present findings in a poster session
Explain what relates and differentiates relational data processing, MapReduce and Resilient Distributed Datasets
Describe common data quality issues and apply error detection and data cleaning methods
Program key Big Data systems

Teaching methods

Lecture
Self-study
Computer lab session/practical training
Presentation/symposium
Working independently on e.g. a project or thesis

Learning activities

Activity	Hours
Hoorcollege	14
Laptopcollege	14
Presentatie	6
Werkcollege	14
Self study	120
Total	168	(6 EC x 28 uur)

Attendance

In TER part B of this programme no requirements regarding attendance are mentioned.

Additional requirements for this course:

Participation will be measured. Attendance in the lab sessions is highly recommended in order to attain the programming skills and background required for the assignments and the project.

Assessment

Item and weight	Details
Final grade
55% Tentamen
40% Project Presentation
5% (Lab) Assignments

To pass the course, all parts should be passed. The exam will be an open book exam (based on the materials provided on canvas), and, in case of a resit, the resit grade will count.

Assignments

Programming assignments 1 & 2

Individual assignments, will be auto-graded with immediate feedback.

Open questions

Individual assignment, will be graded on canvas.

Project poster presentation

Group assignment, will be graded via a poster session.

Fraud and plagiarism

The 'Regulations governing fraud and plagiarism for UvA students' applies to this course. This will be monitored carefully. Upon suspicion of fraud or plagiarism the Examinations Board of the programme will be informed. For the 'Regulations governing fraud and plagiarism for UvA students' see: www.student.uva.nl

Course structure

Week	Lecture	Lab	Online Tech Tutorials
1	Intro & Foundations	Intro & Setup	-
2	Relational Data Processing	Lab Assignment 1: SQL	Version Control with Git
3	MapReduce	Lab Assignment 2: MapReduce / Spark	DuckDB
4	Resilient Distributed Datasets	Project week 1	Apache Spark Deep Dive
5	Data Cleaning	Project week 2	Great Expectations
6	Responsible Data Management	Project week 3	AIF 360
7	Big Data at bol.com	4	Kubernetes
8	Exam	Poster Session	-

Timetable

The schedule for this course is published on DataNose.

Additional information

The course will be taught in English.

Basic programming skills, knowledge of computing systems and machine learning are required.

Contact information

Coordinator

dr. ing. S. Schelter

Owner	Master Information Studies
Coordinator	dr. ing. S. Schelter
Part of	Master Information Studies, track Human Centered Multimedia, year 1 Master Information Studies, track Data Science, year 1 Master Information Studies, track Information Systems, year 1 Master Forensic Science, year 2