Course manual 2021/2022

Course content

Data science is a dynamic and fast-growing interdisciplinary research field that, across science, industry, and government, is altering how people understand the world and make decisions. Not surprisingly, the demand for data science skills is on the rise. This course will cover key principles and tools of data science. In particular, the course will cover the process of acquiring and transforming data; the application of algorithms to learn from data (e.g., classification, regression, clustering); and the application of techniques to make decisions based on data, founded on introductory concepts of game theory. The course will also cover the social and ethical implications of data science, with a particular emphasis on algorithmic fairness and explainability. The course will expose students to theory (i.e., machine learning and statistical methods underlying data science) and practice (i.e., use of data science libraries and analysis of real-world datasets). During the course, students will work on a series of individual exercises and group assignments that will bind together all elements of the data science process. Python will be used for all programming assignments. The course will introduce and make use of Jupyter notebooks, Numpy, Matplotlib, Pandas and Scikit-learn.

Study materials

Literature

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning, Springer (link)
VanderPlas, J. (2016), Python Data Science Handbook: Essential tools for working with data. O'Reilly Media (link)

Software

Python 3.x
Anaconda Individual Edition (link)
Fairlearn (link)

Objectives

Explain the main stages and common challenges in data science projects
Describe the computational and mathematical techniques underlying data science
Compare basic statistical and machine learning methods to recognize patterns in data and identify the settings where each should be applied
Apply data science (Python) libraries to obtain, transform, analyze and visualize real-world datasets
Describe how to use data science to perform consequential decisions and understand the challenges of optimal decision-making in dynamic environments
Be familiar with the ethical concerns associated with data science and reason about algorithmic fairness and explainability

Teaching methods

Lecture
Self-study
Computer lab session/practical training
Working independently on e.g. a project or thesis

Each week the students are incentivized to work independently in the lab exercises and in the suggested readings (Self-study). In the plenary lectures (Hoorcollege), we will focus on discussing theoretical contents that bind together what students are supposed to learn during the week. Lectures will be used to discuss theoretical aspects of data science and to interact with invited guest speakers. Given that plenary lectures will occur after some labs (Werkcollege), it is advised that students read and do independent work before each lab, and use the lab to discuss, ask questions and confirm their solutions. In lab exercises, students will work with examples of data science applications implemented in Python. Students will get together in groups (4-5 people) to work on two assignments.

Learning activities

Activity	Number of hours
Hoorcollege	28
Project	60
Werkcollege	14
Zelfstudie	66

Attendance

In TER part B of this programme no requirements regarding attendance are mentioned.

Assessment

Item and weight	Details
Final grade

Assessment will consist of

2 Group Assignments (25% each)
3 Online quizzes (5% each)
1 Exam (35%)

Changes (e.g, due to changes in COVID-19 restrictions) can apply and will be communicated through Canvas.

Assignments

Group Assignment 1

With group members: analyse the corpus of texts of UN General Debate statements from 1970 to 2020. Relate with external datasets (25% of overall grade)

Group Assignment 2

With group members: analyse several ethical aspects of a case chosen by the group and write a paper about it. Use the Networked Systems Ethics guidelines to guide analysis (25% of overall grade)

Fraud and plagiarism

The 'Regulations governing fraud and plagiarism for UvA students' applies to this course. This will be monitored carefully. Upon suspicion of fraud or plagiarism the Examinations Board of the programme will be informed. For the 'Regulations governing fraud and plagiarism for UvA students' see: www.student.uva.nl

Course structure

(Tentative) schedule and readings *:

	Week	Hoorcollege 1	Hoorcollege 2	Werkcollege	Assessment	Weekly Readings
Part I: Elements of Data Science	1	Introduction, Course Overview Python basics and NumPy	The Data Science life cycle Exploratory Data Analysis Pandas	Exercises with Python and NumPy	-	Ch 1 and 2 of [2] [3] [4]
	2	Visual Analytics (by Marcel Worring)	Data visualization with Matplotlib Elements of statistical learning	Exercises with Pandas and Matplotlib; Analysis of the 2021 Happiness Report dataset (link)	Quiz 1 (5%) Time TBD	Ch 3 and 4 of [2] [5] [6] [7]
	3	Regression: Linear, Polynomial Logistic. Feature Engineering Model Validation	Classification: Naïve Bayes Feature engineering: text features Short Intro to NLTK	Regression and classification exercises; Intro to the group assignment: Analysis of UN Debates dataset (link)	Quiz 2 (5%) Time TBD	Ch 3, 4 and 8.1 of [1] Ch 4 of [2]
	4	Ensemble Methods: Decision Trees and Random Forests Support Vector Machines	Principal Component Analysis Clustering, K-Means	Support to group assignment	Assignment I (25%) due 4/10 23:59	Ch 8.2, 9 and 10 of [1] Ch 4 of [2]
Part II: Data Science in the Real World	5	Ethics & Data Science I (by Arjan Vreeken)	Ethics & Data Science II (by Arjan Vreeken)	Compute fairness metrics in concrete datasets; Fairlearn	Quiz 3 (5%) Time TBD	[8]
	6	Fairness metrics Biased data and word embeddings Explanation techniques	Consequential decision making; Data science in dynamic environments	Support to ethics group assignment	Assignment II - Ethics (25%) due 18/10 23:59	[9] [10] [11] [12]
	7	Data Science in the field and synthetic data generation (by Max Baak)	Wrap up Connections with next courses Discussion	Discussion of assignments Data science with network data (NetworkX)	-	-
	8				Exam (35%) Check DataNose

* check Canvas for updates.

Primary bibliography

[1] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning, Springer (link)

[2] VanderPlas, J. (2016), Python Data Science Handbook: Essential tools for working with data. O'Reilly Media (link)

Recommended readings

[3] Blei, D. M., & Smyth, P. (2017). Science and data science. Proceedings of the National Academy of Sciences, 114(33), 8689-8692. (link)

[4] Chapter 1: O'Neil, C., & Schutt, R. (2013). Doing data science: Straight talk from the frontline. O'Reilly Media, Inc. (link)

[5] D. Sacha et al.: Knowledge generation model for visual analytics. IEEE TVCG, 20 (12), pp. 1604 – 1613, December 2014.

[6] A. Endert et al.: The state of the art in integrating machine learning into visual analytics: Integrating machine learning into visual analytics. Computer Graphics Forum, 36 (4), March 2017.

[7] M. Worring et al.: Multimedia pivot tables for multimedia analytics on image collections. IEEE TMM, 18 (11), pp. 2217 – 2227, September 2016.

[8] Barocas, S., & Boyd, D. (2017). Engaging the ethics of data science in practice. Communications of the ACM, 60(11), 23-25.

[9] Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Poceedings of the National Academy of Sciences, 116(44), 22071-22080. (link)

[10] Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). " Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135-1144) (link)

[11] Chouldechova, A., & Roth, A. (2020). A snapshot of the frontiers of fairness in machine learning. Communications of the ACM, 63(5), 82-89. (link)

[12] Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., & Hardt, M. (2018, July). Delayed impact of fair machine learning. In International Conference on Machine Learning (pp. 3150-3158). PMLR. (link)

Recommended reading in case students need a Python refresher

VanderPlas, J. (2016), A Whirlwind Tour of Python. O'Reilly Media (link)

Timetable

The schedule for this course is published on DataNose.

Contact information

Coordinator

Fernando Pascoal Dos Santos

Staff

S.E. Altamirano
Peter Fratric
Reshmi Gopalakrishna Pillai MSc
J.P. Lebre Magalhães Pereira
Ilse van der Linden MSc
Sara Mahdavi Hezavehi
Dimitris Michailidis MSc
dr. ing. C.M. Rodriguez Rivero
T.J. van Sonsbeek
drs. Arjan Vreeken

Owner	Master Information Studies
Coordinator	Fernando Pascoal Dos Santos
Part of	Master Information Studies, track Data Science, year 1

Fundamentals of Data Science