Fundamentals of Data Science

6 EC

Semester 1, period 1

5294FUDS6Y

Owner Master Information Studies
Coordinator Fernando Pascoal Dos Santos
Part of Master Information Studies, track Data Science, year 1Master Forensic Science, year 2

Course manual 2024/2025

Course content

Data science is a dynamic and fast-growing interdisciplinary research field that, across science, industry, and government, is altering how people understand the world and make decisions. Not surprisingly, the demand for data science skills is on the rise. This course will cover key principles and tools of data science. In particular, the course will cover the process of acquiring, transforming and visualizing data; the application of algorithms to learn from data (e.g., classification, regression, clustering); and the application of techniques to make decisions based on data. This course will cover network data analysis, the social and ethical implications of data science (with emphasis on algorithmic fairness, privacy and explainability) and the application of recent generative AI tools in the data science life-cycle. The course will expose students to theory (i.e., machine learning and statistical methods) and practice (i.e., use of data science libraries and analysis of real-world datasets). During the course, students will work on a series of individual exercises and group assignments that will bind together all elements of the data science process. Python will be used for all programming assignments and projects. The course will introduce and make use of Jupyter notebooks, Numpy, Matplotlib and Pandas. Auxiliary libraries such as NetworkX, GeoPandas and Seaborn will also be covered.

Study materials

Literature

  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning, 2nd edition, Springer (link)

  • VanderPlas, J. (2016), Python Data Science Handbook: Essential tools for working with data. O'Reilly Media (link)

  • Scientific articles shared via Canvas (See Course Structure for more info)

Software

  • Python 3.x

  • Anaconda Individual Edition (link)

  • Numpy, Matplotlib, Pandas, Scikit-learn and auxiliary libraries (e.g., GeoPandas, NetworkX, Fairlearn, Seaborn)

Objectives

  • Explain the main stages and common challenges in data science projects.
  • Apply (Python) libraries to clean, transform and analyse real-world datasets in an efficient way.
  • Recognise good data visualisation practices and develop effective plots and figures.
  • Describe the mathematical and computational techniques underlying basic supervised and unsupervised learning algorithms, as well as network data analysis and generative modelling.
  • Implement statistical and machine learning methods to recognize patterns in data and judge the settings where each method should be applied.
  • Design a data science project to perform exploratory data analysis and/or answer a research question.
  • Be familiar with the ethical concerns associated with data science and reason about algorithmic fairness, privacy and explainability.
  • Describe how to use data science to perform consequential decisions and understand the challenges of optimal decision-making in dynamic environments.
  • Apply recent generative AI tools to aid data visualization, augmentation and preparation

Teaching methods

  • Lecture
  • Self-study
  • Computer lab session/practical training
  • Working independently on e.g. a project or thesis

Each week the students are incentivized to work independently in the lab exercises and in the suggested readings (Self-study). In the plenary lectures (Hoorcollege), we will focus on discussing theoretical contents that bind together what students are supposed to learn during the week. Lectures will be used to discuss theoretical aspects of data science and to interact with invited guest speakers. Werkcollege will be used to practice concepts learned in the lectures. It is advised that students read and do independent work before each lab (Werkcollege), and use the lab to discuss, ask questions and confirm their solutions. In lab exercises, students will work with examples of data science applications implemented in Python. Students will get together in groups (4 people) to work on two assignments. We endeavor to have practical material available to you before it will be used (Friday, before the respective classes). All hoorcollege and werkcollege will be in-person, at Science Park.

Learning activities

Activity

Number of hours

Hoorcollege

26

Project

60

Werkcollege

14

Zelfstudie

60

Attendance

In TER part B of this programme no requirements regarding attendance are mentioned.

Assessment

Item and weight Details

Final grade

0.35 (35%)

Exam

Mandatory

0.25 (25%)

Assignment 1

Mandatory

0.25 (25%)

Assignment 2

Mandatory

0.05 (5%)

Quiz 1

0.05 (5%)

Quiz 2

0.05 (5%)

Quiz 3

Assessment will consist of

  • 2 Group Assignments (25% each)
  • 3 Online quizzes (5% each)
  • 1 Exam (35%)

Any changes will be communicated through Canvas.

Assignments

Group Assignment 1

  • With group members: analyse the corpus of texts of UN General Debate statements from 1970 to 2024. Relate with external datasets (25% of overall grade)

Group Assignment 2

  • With group members: develop and evaluate a link prediction model; discuss  ethical aspects of the general problems your tried to solve, as well a your particular solution (25% of overall grade)

Fraud and plagiarism

The 'Regulations governing fraud and plagiarism for UvA students' applies to this course. This will be monitored carefully. Upon suspicion of fraud or plagiarism the Examinations Board of the programme will be informed. For the 'Regulations governing fraud and plagiarism for UvA students' see: www.student.uva.nl

Course structure

Course structure and study materials *

 

Week

Hoorcollege 1

Hoorcollege 2

Werkcollege

Assessment

Weekly Readings

Part I:

Elements of Data Science

1

2/9

Introduction, Course Overview;


Python basics

NumPy

The Data Science life cycle


Exploratory Data Analysis

Pandas

Lab 1: Exercises with Python and NumPy

Quiz 0 (0%)

Sep 6

13:00-19:00

15 min 

Ch 2 and 3 of [2]


Suggested: [3] [4] [5]

2

9/9

Visualization best practices


Data visualization with Matplotlib

Regression: Linear and Polynomial Regression


Feature Engineering

Bias-variance tradeoff

Regularization

 

Lab 2: Exercises with Pandas and Matplotlib

Quiz 1 (5%)

Sep 13

13:00-19:00

15 min 

Ch 4 and 5 of [2]

3

16/9

Classification: Logistic Regression, Naïve Bayes, k-NN

Gradient Descent


Generative/Discriminative, Parametric/non-parametric models

Decision Trees and Ensemble Methods


Model Validation 


Network science and graph analysis: Link prediction

Lab 3: Regression and classification; 


Intro to group assignment

Quiz 2 (5%)

Sep 20

13:00-19:00

15 min

Ch 2, 3, 4, 5.1, 8.1 of [1]


Ch 5 of [2]


Suggested:

[6]

4

23/9

Monday, Sep 23


Introduction to Unsupervised Learning: K-Means and PCA


Community Detection

Assignment Q&A (online)

Support to finish group assignment

Assignment I (25%)

suggested deadline 27/9


hard deadline 30/9 23:59 

Ch 4.4, 8.2 and 12 of [1]

Ch 5 of [2]


Suggested:

[7] [8]

Part II:

Data Science in the Real World

5

30/9

Ethics and Data Science: Introduction & Fairness

(by Arjan Vreeken)

Ethics and Data Science: Privacy & Transparency

(by Arjan Vreeken)

Lab 4: Compute fairness metrics in concrete datasets;

Fairlearn

Network analysis

Quiz 3 (5%)

Oct 4

13:00-19:00

15 min

Suggested: [9]

6

7/10

Fairness metrics and techniques to mitigate bias in data science


Transparency in Data Science

Differential Privacy


Data Science in Dynamic Environments

Support to finish group assignment

Assignment II (25%)

suggested deadline 11/10 


hard deadline 14/10 23:59 

Suggested: [10] [11] [12]

7

14/10

Monday, Oct 14


Invited Lecture

(by Marcel Worring)


Generative AI in the Data Science life cycle 

Wrap-up and connection with next courses


Thesis tips (by Lester van der Pluijm)


Exam preparation

Discussion of assignments; Exam preparation

Exam preparation

-

Exam

8

21/10

Exam (23/10)

Exam (35%)

Check DataNose

 

* Check Canvas for updates

 

Primary bibliography 

[1] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning, 2nd Edt, Springer (link)

[2] VanderPlas, J. (2016), Python Data Science Handbook: Essential tools for working with data. O'Reilly Media (link)

 

Suggested readings

[3] Blei, D. M., & Smyth, P. (2017). Science and data science. Proceedings of the National Academy of Sciences, 114(33), 8689-8692. (link)

[4] Chapter 1: O'Neil, C., & Schutt, R. (2013). Doing data science: Straight talk from the frontline. O'Reilly Media, Inc. (link)

[5] Leek, Jeffery T., and Roger D. Peng. What is the question?, Science 347.6228 (2015): 1314-1315 (link)

[6] Rougier, Nicolas P., Michael Droettboom, and Philip E. Bourne. "Ten simple rules for better figures." PLoS Computational Biology 10.9 (2014): e1003833 (link)

[7] Newman,M. The structure and function of complex networks. SIAM Rev 45.2 (2003)167-256 (Sec. I, II, III) (link)

[8] Liben-Nowell, D., & Kleinberg, J. (2003). The link prediction problem for social networks. In Proceedings of CIKM 2003 (link)

[9] Barocas, S., & Boyd, D. (2017). Engaging the ethics of data science in practice. Communications of the ACM, 60(11), 23-25 (link)

[10] Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences, 116(44), 22071-22080. (link)

[11] Chouldechova, A., & Roth, A. (2020). A snapshot of the frontiers of fairness in machine learning. Communications of the ACM, 63(5), 82-89. (link)

[12] Dwork, Cynthia, and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9.3–4 (2014): 211-407 (link)

Recommended reading in case students need a Python refresher

VanderPlas, J. (2016), A Whirlwind Tour of Python. O'Reilly Media  (link)

Contact information

Coordinator

  • Fernando Pascoal Dos Santos

Staff and guest lecturers 

  • dr. Ali Alsahag PhD
  • J.N. Bastos Fonseca MSc
  • Alexandre da Silva Pires
  • dr. S.S. Mohammadi Ziabari PhD
  • Y. Ma
  • A. Mukundan
  • Madhura Pawar
  • Martin Smit
  • drs. Arjan Vreeken
  • prof. dr. Marcel Worring