Course manual 2022/2023

Course content

Data science is a dynamic and fast-growing interdisciplinary research field that, across science, industry, and government, is altering how people understand the world and make decisions. Not surprisingly, the demand for data science skills is on the rise. This course will cover key principles and tools of data science. In particular, the course will cover the process of acquiring and transforming data; the application of algorithms to learn from data (e.g., classification, regression, clustering); analyze network data; and the application of techniques to make decisions based on data. The course will also cover the social and ethical implications of data science, with a particular emphasis on algorithmic fairness and explainability. The course will expose students to theory (i.e., machine learning and statistical methods underlying data science) and practice (i.e., use of data science libraries and analysis of real-world datasets). During the course, students will work on a series of individual exercises and group assignments that will bind together all elements of the data science process. Python will be used for all programming assignments. The course will introduce and make use of Jupyter notebooks, Numpy, Matplotlib, Pandas, Scikit-learn and auxiliary libraries (e.g.,GeoPandas, NetworkX, Fairlearn, Seaborn).

Study materials

Literature

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning, 2nd edition, Springer (link)
VanderPlas, J. (2016), Python Data Science Handbook: Essential tools for working with data. O'Reilly Media (link)

Software

Python 3.x
Anaconda Individual Edition (link)
Numpy, Matplotlib, Pandas, Scikit-learn and auxiliary libraries (e.g., GeoPandas, NetworkX, Fairlearn, Seaborn)

Objectives

Explain the main stages and common challenges in data science projects.
Apply (Python) libraries to clean, transform and analyze real-world datasets in an efficient way.
Recognize good data visualization practices and develop effective plots and figures.
Describe the mathematical and computational techniques underlying basic supervised and unsupervised learning algorithms, as well as network data analysis.
Implement statistical and machine learning methods to recognize patterns in data and judge the settings where each method should be applied.
Design a data science project to perform exploratory data analysis and/or answer a research question.
Be familiar with the ethical concerns associated with data science and reason about algorithmic fairness, privacy and explainability.
Describe how to use data science to perform consequential decisions and understand the challenges of optimal decision-making in dynamic environments.

Teaching methods

Lecture
Self-study
Computer lab session/practical training
Working independently on e.g. a project or thesis

Each week the students are incentivized to work independently in the lab exercises and in the suggested readings (Self-study). In the plenary lectures (Hoorcollege), we will focus on discussing theoretical contents that bind together what students are supposed to learn during the week. Lectures will be used to discuss theoretical aspects of data science and to interact with invited guest speakers. Werkcollege will be used to practice concepts learned in the lectures. It is advised that students read and do independent work before each lab (Werkcollege), and use the lab to discuss, ask questions and confirm their solutions. In lab exercises, students will work with examples of data science applications implemented in Python. Students will get together in groups (4 people) to work on two assignments. We endeavor to have practical material available to you the week before it will be used (maximum, Friday before the respective classes). All hoorcollege and werkcollege will be in-person, at Science Park.

Learning activities

Activity	Number of hours
Hoorcollege	28
Project	60
Werkcollege	14
Zelfstudie	66

Attendance

In TER part B of this programme no requirements regarding attendance are mentioned.

Assessment

Item and weight	Details
Final grade
0.35 (35%) Tentamen
0.5 (50%) Assignments
0.15 (15%) Online quizzes

Assessment will consist of

2 Group Assignments (25% each)
3 Online quizzes (5% each)
1 Exam (35%)

Any changes will be communicated through Canvas.

Assignments

Group Assignment 1

With group members: analyse the corpus of texts of UN General Debate statements from 1970 to 2021. Relate with external datasets (25% of overall grade)

Group Assignment 2

With group members: analyse several ethical aspects of a case chosen by the group and write a paper about it. Use the Networked Systems Ethics guidelines to guide analysis (25% of overall grade)

Fraud and plagiarism

The 'Regulations governing fraud and plagiarism for UvA students' applies to this course. This will be monitored carefully. Upon suspicion of fraud or plagiarism the Examinations Board of the programme will be informed. For the 'Regulations governing fraud and plagiarism for UvA students' see: www.student.uva.nl

Course structure

Course structure and study materials *

Week

Hoorcollege 1

Hoorcollege 2

Werkcollege

Assessment

Weekly Readings

5/9

Introduction, Course Overview;

Python basics

NumPy

The Data Science life cycle

Exploratory Data Analysis

Pandas

Lab 1: Exercises with Python and NumPy

Quiz 0 (0%)

Sep 9

Ch 2 and 3 of [2]

Suggested: [3] [4] [5]

12/9

Visualization best practices

Data visualization with Matplotlib

Elements of Statistical Learning

Visual Analytics

(by Marcel Worring)

Lab 2: Exercises with Pandas and Matplotlib; Analysis of the 2022 Happiness Report dataset

Quiz 1 (5%)

Sep 16

Ch 4 and 5 of [2]

Suggested: [6] [7] [8]

19/9

Regression: Linear and Polynomial Regression

Feature Engineering

Model Validation

Bias-variance tradeoff

Regularization

Classification:

Logistic Regression, Naïve Bayes, k-NN

Gradient Descent

Generative/Discriminative, Parametric/non-parametric models

Lab 3: Regression and classification; Intro to group assignment:

Analysis of UN Debates dataset

Network analysis

Quiz 2 (5%)

Sep 23

Ch 2, 3, 4, 5.1, 8.1 of [1]

Ch 5 of [2]

26/9

Decision Trees and Ensemble Methods

Unsupervised Learning:

Principal Component Analysis

K-Means

Network data:

Principles of network science and graph analysis

Unsupervised Learning: Community Detection

Support to finish group assignment

Assignment I (25%)

suggested deadline 30/9

hard deadline 3/10 23:59

Ch 4.4, 8.2 and 12 of [1]

Ch 5 of [2]

Suggested:[9] [10]

3/10

Ethics & Data Science I

(by Arjan Vreeken)

Ethics & Data Science II

(by Arjan Vreeken)

Lab 4: Compute fairness metrics in concrete datasets; Fairlearn

Quiz 3 (5%)

Oct 7

Suggested: [11]

10/10

Fairness metrics

Techniques to mitigate bias in data science

Explanation techniques

Differential Privacy

Data Science in Dynamic Environments: Strategic Classification

Support to finish group assignment

Assignment II - Ethics (25%)

suggested deadline 14/10

hard deadline 17/10 23:59

Suggested: [12] [13] [14] [15]

17/10

Data Science in the field

(by Max Baak)

Frontiers of Data Science

Connections with next courses

Exam preparation

Discussion of assignments

Exercises exam preparation

Suggested:

Ch 10 of [1]

24/10

Exam (25/10)

Exam (35%)

* Check Canvas for updates

Primary bibliography

[1] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning, 2nd Edt, Springer (link)

[2] VanderPlas, J. (2016), Python Data Science Handbook: Essential tools for working with data. O'Reilly Media (link)

Suggested readings

[3] Blei, D. M., & Smyth, P. (2017). Science and data science. Proceedings of the National Academy of Sciences, 114(33), 8689-8692. (link)

[4] Chapter 1: O'Neil, C., & Schutt, R. (2013). Doing data science: Straight talk from the frontline. O'Reilly Media, Inc. (link)

[5] Leek, Jeffery T., and Roger D. Peng. What is the question?, Science 347.6228 (2015): 1314-1315 (link)

[6] Rougier, Nicolas P., Michael Droettboom, and Philip E. Bourne. "Ten simple rules for better figures." PLoS Computational Biology 10.9 (2014): e1003833 (link)

[7] A. Endert et al.: The state of the art in integrating machine learning into visual analytics: Integrating machine learning into visual analytics. Computer Graphics Forum, 36 (4) (link)

[8] M. Worring et al.: Multimedia pivot tables for multimedia analytics on image collections. IEEE TMM, 18 (11), pp. 2217 – 2227 (link)

[9] Vespignani, A. Twenty years of network science. Nature (2018): 528-529 (link)

[10] Newman,M. The structure and function of complex networks. SIAM Rev 45.2 (2003)167-256 (1 to 3, 8.2) (link)

[11] Barocas, S., & Boyd, D. (2017). Engaging the ethics of data science in practice. Communications of the ACM, 60(11), 23-25 (link)

[12] Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences, 116(44), 22071-22080. (link)

[13] Chouldechova, A., & Roth, A. (2020). A snapshot of the frontiers of fairness in machine learning. Communications of the ACM, 63(5), 82-89. (link)

[14] Dwork, Cynthia, and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9.3–4 (2014): 211-407 (link)

[15] Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., & Hardt, M. (2018, July). Delayed impact of fair machine learning. In International Conference on Machine Learning (pp. 3150-3158). PMLR. (link)

Recommended reading in case students need a Python refresher

VanderPlas, J. (2016), A Whirlwind Tour of Python. O'Reilly Media (link)

Timetable

The schedule for this course is published on DataNose.

Contact information

Coordinator

Fernando Pascoal Dos Santos

Staff and guest lecturers

S.E. Altamirano Ortega
M.A. Baak
dr. Reshmi Gopalakrishna Pillai PhD
João Lebre Magalhães Pereira
Ilse van der Linden MSc
Sara Mahdavi Hezavehi
Dimitris Michailidis
Lois Rink
M. Tasnim MSc
A. Toshniwal MSc
drs. Arjan Vreeken
prof. dr. Marcel Worring

Owner	Master Information Studies
Coordinator	Fernando Pascoal Dos Santos
Part of	Master Information Studies, track Data Science, year 1

Fundamentals of Data Science