Course manual 2020/2021

Course content

This course will provide students with a general understanding of data-related and systems-related challenges in Big Data applications. They will gain fundamental knowledge about principled approaches to tackle such challenges, with respect to systems abstractions, programming models and execution models for parallel and distributed data-intensive applications.

The course prepares students  for data-related tasks in a job as Data Engineer, ML Engineer, Applied Scientist or Researcher, and puts a focus on their implementation skills, for example by walking students through low-level MapReduce jobs for several data related problems and common data processing operators. At the same time, the course highlights ongoing research problems in the area of Big Data processing. Furthermore, the course details the history of many systems currently at the forefront of computing, e.g., it discusses the roots of Google Tensorflow in previous Big Data systems at Google.

The course will also feature guest speakers from leading companies to connect students to real world problems.

 

 

Study materials

Literature

  • Scientific papers.

Practical training material

  • Programming exercises

Software

  • Apache Hadoop, Apache Maven

Objectives

  • Develop a general understanding of data-related and systems-related challenges in Big Data applications.
  • Gain fundamental knowledge about principled approaches to tackle Big Data challenges, e.g., systems abstractions, programming models and execution models.
  • Practice distributed data processing by implementing low-level MapReduce jobs.
  • Read, understand, and reflect upon a recent research paper in data management.
  • Work in project teams with people from a variety of backgrounds towards solving an end-to-end Big Data problem.

Teaching methods

  • Lecture
  • Self-study
  • Computer lab session/practical training
  • Seminar
  • Presentation/symposium
  • Working independently on e.g. a project or thesis

Learning activities

Activity

Hours

 

Hoorcollege

14

 

Laptopcollege

14

 

Presentatie

6

 

Werkcollege

14

 

Self study

120

 

Total

168

(6 EC x 28 uur)

Attendance

In TER part B of this programme no requirements regarding attendance are mentioned.

Additional requirements for this course:

Participation will be measured. Attendance in the lab sessions is highly recommended in order to attain the programming skills and background required for the assignments.

Assessment

Item and weight Details

Final grade

Participation

Programming Assignment 1

Programming Assignment 2

Paper Summary

Group project

  • The grade is the sum of points across the items (divided by 10) 
  • Late policy: you will be deducted 2 points for every day an assignment is late.

Fraud and plagiarism

The 'Regulations governing fraud and plagiarism for UvA students' applies to this course. This will be monitored carefully. Upon suspicion of fraud or plagiarism the Examinations Board of the programme will be informed. For the 'Regulations governing fraud and plagiarism for UvA students' see: www.student.uva.nl

Course structure

Weeknummer Onderwerpen
1 Foundations of Scalable Data Processing
2 Abstractions for Massively Parallel Data Processing
3 Machine Learning on Distributed Dataflow Systems
4 Distributed Databases
5 Data Validation & Data Cleaning
6 Deep Learning Systems
7 Today's and Tomorrow's Challenges in Big Data Management
8 Final Presentations

Timetable

The schedule for this course is published on DataNose.

Additional information

The course will be taught in English. 

Basic knowledge of computing systems, machine learning and basic programming skills are required. 

Contact information

Coordinator

  • dr. ing. S. Schelter