Course manual 2020/2021

Course content

This course will provide students with a general understanding of data-related and systems-related challenges in Big Data applications. They will gain fundamental knowledge about principled approaches to tackle such challenges, with respect to systems abstractions, programming models and execution models for parallel and distributed data-intensive applications.

The course prepares students for data-related tasks in a job as Data Engineer, ML Engineer, Applied Scientist or Researcher, and puts a focus on their implementation skills, for example by walking students through low-level MapReduce jobs for several data related problems and common data processing operators. At the same time, the course highlights ongoing research problems in the area of Big Data processing. Furthermore, the course details the history of many systems currently at the forefront of computing, e.g., it discusses the roots of Google Tensorflow in previous Big Data systems at Google.

The course will also feature guest speakers from leading companies to connect students to real world problems.

Study materials

Literature

Scientific papers.

Practical training material

Programming exercises

Software

Apache Hadoop, Apache Maven

Objectives

Develop a general understanding of data-related and systems-related challenges in Big Data applications.
Gain fundamental knowledge about principled approaches to tackle Big Data challenges, e.g., systems abstractions, programming models and execution models.
Practice distributed data processing by implementing low-level MapReduce jobs.
Read, understand, and reflect upon a recent research paper in data management.
Work in project teams with people from a variety of backgrounds towards solving an end-to-end Big Data problem.

Teaching methods

Lecture
Self-study
Computer lab session/practical training
Seminar
Presentation/symposium
Working independently on e.g. a project or thesis

Learning activities

Activity	Hours
Hoorcollege	14
Laptopcollege	14
Presentatie	6
Werkcollege	14
Self study	120
Total	168	(6 EC x 28 uur)

Attendance

In TER part B of this programme no requirements regarding attendance are mentioned.

Additional requirements for this course:

Participation will be measured. Attendance in the lab sessions is highly recommended in order to attain the programming skills and background required for the assignments.

Assessment

Item and weight	Details
Final grade
Participation
Programming Assignment 1
Programming Assignment 2
Paper Summary
Group project

The grade is the sum of points across the items (divided by 10)
Late policy: you will be deducted 2 points for every day an assignment is late.

Fraud and plagiarism

The 'Regulations governing fraud and plagiarism for UvA students' applies to this course. This will be monitored carefully. Upon suspicion of fraud or plagiarism the Examinations Board of the programme will be informed. For the 'Regulations governing fraud and plagiarism for UvA students' see: www.student.uva.nl

Course structure

Weeknummer	Onderwerpen
1	Foundations of Scalable Data Processing
2	Abstractions for Massively Parallel Data Processing
3	Machine Learning on Distributed Dataflow Systems
4	Distributed Databases
5	Data Validation & Data Cleaning
6	Deep Learning Systems
7	Today's and Tomorrow's Challenges in Big Data Management
8	Final Presentations

Timetable

The schedule for this course is published on DataNose.

Additional information

The course will be taught in English.

Basic knowledge of computing systems, machine learning and basic programming skills are required.

Contact information

Coordinator

dr. ing. S. Schelter

Owner	Master Information Studies
Coordinator	dr. ing. S. Schelter
Part of	Master Information Studies, track Human Centered Multimedia, year 1 Master Information Studies, track Data Science, year 1 Master Information Studies, track Information Systems, year 1 Master Forensic Science, year 2