6 EC
Semester 1, period 1
5294FUDS6Y
| Owner | Master Information Studies |
| Coordinator | Fernando Pascoal Dos Santos |
| Part of | Master Information Studies, track Data Science, year 1Master Forensic Science, year 2 |
Data science is a dynamic and fast-growing interdisciplinary research field that, across science, industry, and government, is altering how people understand the world and make decisions. Not surprisingly, the demand for data science skills is on the rise. This course will cover key principles and tools of data science. In particular, the course will cover the process of acquiring and transforming data; the application of algorithms to learn from data (e.g., classification, regression, clustering); and the application of techniques to make decisions based on data. This course will cover network data analysis, as well as the social and ethical implications of data science, with a particular emphasis on algorithmic fairness, privacy and explainability. The course will expose students to theory (i.e., machine learning and statistical methods underlying data science) and practice (i.e., use of data science libraries and analysis of real-world datasets). During the course, students will work on a series of individual exercises and group assignments that will bind together all elements of the data science process. Python will be used for all programming assignments and projects. The course will introduce and make use of Jupyter notebooks, Numpy, Matplotlib and Pandas. Auxiliary libraries such as NetworkX, GeoPandas and Seaborn will also be covered.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning, 2nd edition, Springer (link)
VanderPlas, J. (2016), Python Data Science Handbook: Essential tools for working with data. O'Reilly Media (link)
Scientific articles shared via Canvas (See Course Structure for more info)
Python 3.x
Anaconda Individual Edition (link)
Numpy, Matplotlib, Pandas, Scikit-learn and auxiliary libraries (e.g., GeoPandas, NetworkX, Fairlearn, Seaborn)
Each week the students are incentivized to work independently in the lab exercises and in the suggested readings (Self-study). In the plenary lectures (Hoorcollege), we will focus on discussing theoretical contents that bind together what students are supposed to learn during the week. Lectures will be used to discuss theoretical aspects of data science and to interact with invited guest speakers. Werkcollege will be used to practice concepts learned in the lectures. It is advised that students read and do independent work before each lab (Werkcollege), and use the lab to discuss, ask questions and confirm their solutions. In lab exercises, students will work with examples of data science applications implemented in Python. Students will get together in groups (4 people) to work on two assignments. We endeavor to have practical material available to you before it will be used (Friday, before the respective classes). All hoorcollege and werkcollege will be in-person, at Science Park.
|
Activity |
Number of hours |
|
Hoorcollege |
28 |
|
Project |
60 |
|
Werkcollege |
14 |
|
Zelfstudie |
66 |
In TER part B of this programme no requirements regarding attendance are mentioned.
| Item and weight | Details |
|
Final grade | |
|
1 (100%) Tentamen |
Assessment will consist of
Any changes will be communicated through Canvas.
With group members: analyse the corpus of texts of UN General Debate statements from 1970 to 2023. Relate with external datasets (25% of overall grade)
With group members: develop and evaluate a link prediction model; discuss ethical aspects of the general problems your tried to solve, as well a your particular solution (25% of overall grade)
The 'Regulations governing fraud and plagiarism for UvA students' applies to this course. This will be monitored carefully. Upon suspicion of fraud or plagiarism the Examinations Board of the programme will be informed. For the 'Regulations governing fraud and plagiarism for UvA students' see: www.student.uva.nl
Course structure and study materials *
|
|
Week |
Hoorcollege 1 |
Hoorcollege 2 |
Werkcollege |
Assessment |
Weekly Readings |
|
Part I: Elements of Data Science |
1 4/9 |
Introduction, Course Overview; Python basics NumPy |
The Data Science life cycle Exploratory Data Analysis Pandas |
Lab 1: Exercises with Python and NumPy |
Quiz 0 (0%) Sep 8 13:00-19:00 15 min |
Ch 2 and 3 of [2] Suggested: [3] [4] [5] |
|
2 11/9 |
Visualization best practices Data visualization with Matplotlib Elements of Statistical Learning |
Visual Analytics (by Marcel Worring)
|
Lab 2: Exercises with Pandas and Matplotlib |
Quiz 1 (5%) Sep 15 13:00-19:00 15 min |
Ch 4 and 5 of [2] Suggested: [6] [7] [8] |
|
|
3 18/9 |
Regression: Linear and Polynomial Regression Feature Engineering Bias-variance tradeoff Regularization |
Classification: Logistic Regression, Naïve Bayes, k-NN Gradient Descent Generative/Discriminative, Parametric/non-parametric models |
Lab 3: Regression and classification; Intro to group assignment |
Quiz 2 (5%) Sep 22 13:00-19:00 15 min |
Ch 2, 3, 4, 5.1, 8.1 of [1] Ch 5 of [2] |
|
|
4 25/9 |
Decision Trees and Ensemble Methods Model Validation (Classification) Introduction to Unsupervised Learning: K-Means and PCA Assignment Q&A |
Network science and graph analysis: Link prediction (Supervised Learning) Community Detection (Unsupervised Learning) TA Research Talk |
Support to finish group assignment |
Assignment I (25%) suggested deadline 29/9 hard deadline 2/10 23:59 |
Ch 4.4, 8.2 and 12 of [1] Ch 5 of [2] Suggested:[9] [10][11] |
|
|
Part II: Data Science in the Real World |
5 2/10 |
Ethics & Data Science Case study discussion (by Arjan Vreeken) |
Ethics & Data Science Background (by Arjan Vreeken) |
Lab 4: Compute fairness metrics in concrete datasets; Network analysis |
Quiz 3 (5%) Oct 6 13:00-19:00 15 min |
Suggested: [12] |
|
6 9/10 |
Fairness metrics and techniques to mitigate bias in data science Transparency in DS |
Differential Privacy Data Science in Dynamic Environments TA Research Talk |
Support to finish group assignment |
Assignment II (25%) suggested deadline 13/10 hard deadline 16/10 23:59 |
Suggested: [13] [14] [15] |
|
|
7 16/10 |
Data Science in the field (by David Graus) |
Wrap-up Connections with next courses Exam preparation |
Discussion of assignments; Exam preparation |
Exam preparation |
- |
|
|
Exam |
8 23/10 |
Exam |
Exam (35%) Check DataNose |
|||
* Check Canvas for updates
Primary bibliography
[1] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning, 2nd Edt, Springer (link)
[2] VanderPlas, J. (2016), Python Data Science Handbook: Essential tools for working with data. O'Reilly Media (link)
Suggested readings
[3] Blei, D. M., & Smyth, P. (2017). Science and data science. Proceedings of the National Academy of Sciences, 114(33), 8689-8692. (link)
[4] Chapter 1: O'Neil, C., & Schutt, R. (2013). Doing data science: Straight talk from the frontline. O'Reilly Media, Inc. (link)
[5] Leek, Jeffery T., and Roger D. Peng. What is the question?, Science 347.6228 (2015): 1314-1315 (link)
[6] Rougier, Nicolas P., Michael Droettboom, and Philip E. Bourne. "Ten simple rules for better figures." PLoS Computational Biology 10.9 (2014): e1003833 (link)
[7] A. Endert et al.: The state of the art in integrating machine learning into visual analytics: Integrating machine learning into visual analytics. Computer Graphics Forum, 36 (4) (link)
[8] M. Worring et al.: Multimedia pivot tables for multimedia analytics on image collections. IEEE TMM, 18 (11), pp. 2217 – 2227 (link)
[9] Vespignani, A. Twenty years of network science. Nature (2018): 528-529 (link)
[10] Newman,M. The structure and function of complex networks. SIAM Rev 45.2 (2003)167-256 (Sec. 1 and 2; 3.2) (link)
[11] Zhou, T. (2021). Progresses and challenges in link prediction. iScience, 24(11) (link)
[12] Barocas, S., & Boyd, D. (2017). Engaging the ethics of data science in practice. Communications of the ACM, 60(11), 23-25 (link)
[13] Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences, 116(44), 22071-22080. (link)
[14] Chouldechova, A., & Roth, A. (2020). A snapshot of the frontiers of fairness in machine learning. Communications of the ACM, 63(5), 82-89. (link)
[15] Dwork, Cynthia, and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9.3–4 (2014): 211-407 (link)
Recommended reading in case students need a Python refresher
VanderPlas, J. (2016), A Whirlwind Tour of Python. O'Reilly Media (link)
The schedule for this course is published on DataNose.