6 EC
Semester 2, period 4
5294BIDA6Y
Data is at the center of modern businesses and institutions, and we are experiencing a big shift towards predictive, data-driven decision making in recent years. This development has given rise to Big Data, a novel set of data storage and processing methods, accompanied by a new software stack that serves as the foundation for the ongoing AI revolution. Pioneered by companies building web-scale search engines such as Google, Big Data technologies are used across all industries nowadays and are a major component of modern cloud infrastructure.
In this course, we study various abstractions, processing techniques and Big Data systems for working with large collections of data, including relational database management systems, MapReduce and Apache Spark. We address the foundations of Big Data applications, as well as topics like choosing a suitable data representation and implementing distributed processing operations. We review and focus on the foundations of distributed data processing and programming models for parallelisable programs, as well as on data quality and responsible data management.
You will learn theoretical concepts for data processing during the lectures. In the lab you will gain hands-on experience through a number of coding assignments and by participating in a Kaggle-like Big Data competition. In addition, we offer online tutorials for hands-on experience with established industry software packages. Finally, experts from the field (both academic as well as industry colleagues) are invited for giving guest lectures.
Book Chapters
Scientific Papers
Presentation Slides
A syllabus containing articles and chapters will be made available at the beginning of the course.
Programming assignments
Kaggle-like data project
Video Tutorials
Python
Jupyter
DuckDB
PySpark
Videos
|
Activity |
Hours |
|
|
Hoorcollege |
14 |
|
|
Laptopcollege |
14 |
|
|
Presentatie |
6 |
|
|
Werkcollege |
14 |
|
|
Self study |
120 |
|
|
Total |
168 |
(6 EC x 28 uur) |
In TER part B of this programme no requirements regarding attendance are mentioned.
Additional requirements for this course:
Participation will be measured. Attendance in the lab sessions is highly recommended in order to attain the programming skills and background required for the assignments and the project.
| Item and weight | Details |
|
Final grade | |
|
55% Tentamen | |
|
5% AllAssignments | |
|
40% Group projects / poster session |
To pass the course, all parts should be passed. The exam will be an open book exam (based on the materials provided on canvas), and, in case of a resit, the resit grade will count.
Individual assignments, will be auto-graded with immediate feedback.
Individual assignment, will be graded on canvas.
Group assignment, will be graded via a poster session.
The 'Regulations governing fraud and plagiarism for UvA students' applies to this course. This will be monitored carefully. Upon suspicion of fraud or plagiarism the Examinations Board of the programme will be informed. For the 'Regulations governing fraud and plagiarism for UvA students' see: www.student.uva.nl
| Week | Lecture | Lab |
Online Tech Tutorials |
| 1 | Intro & Foundations | Intro & Setup | - |
| 2 | Relational Data Processing | Lab Assignment 1: SQL | Version Control with Git |
| 3 | MapReduce | Lab Assignment 2: MapReduce / Spark | DuckDB |
| 4 | Resilient Distributed Datasets | Project week 1 | Apache Spark Deep Dive |
| 5 | Data Cleaning | Project week 2 | Great Expectations |
| 6 | Big Data at booking.com | Project week 3 | Kubernetes |
| 7 | Responsible Data Management | Project Week 4 | AIF 360 |
| 8 | Exam | Poster Session | - |
The course will be taught in English.
Basic programming skills, knowledge of computing systems and machine learning are required.