Stefan Grafberger

Stefan Grafberger

Ph.D. Student

University of Amsterdam


I am a Ph.D. student at the University of Amsterdam, conducting research at the intersection of data management and machine learning.

I am a student at the AIRlab Amsterdam and part of the Intelligent Data Engineering Lab led by Paul Groth and the Information and Language Processing Systems group led by Maarten de Rijke. My Ph.D. supervisors are Sebastian Schelter and Paul Groth.

In the past, I interned at Amazon Research and Oracle Labs, and worked as a research assistant in the database group at TU Munich. My master’s thesis was supervised by Sebastian Schelter, Julia Stoyanovich, and Alfons Kemper.


Recent Publications

All publications

(2021). HedgeCut: Maintaining Randomized Trees for Low-Latency Machine Unlearning. ACM SIGMOD.


(2021). mlinspect: a Data Distribution Debugger for Machine Learning Pipelines. ACM SIGMOD (demo).


(2020). Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines. Conference on Innovative Data Systems Research (CIDR).


(2019). Differential Data Quality Verification on Partitioned Data. International Conference on Data Engineering (ICDE).


(2018). Deequ - Data Quality Validation for Machine Learning Pipelines. Machine Learning Systems workshop at the conference on Neural Information Processing Systems (NeurIPS).



Before starting as a Ph.D. Student at University of Amsterdam, I have been a student of the Software Engineering Elite Graduate Program at TU Munich, LMU Munich, and University of Augsburg. I received my bachelor’s degree from University of Augsburg.

During my studies, I interned with Amazon Research, Oracle Labs, and worked as a research assistant at TU Munich. I also interned and worked as a working student at TNG Technology Consulting in Munich and worked as a teaching assistant at University of Augsburg.

In the past, I have been working on deequ, a library for ‘unit-testing’ large datasets with Apache Spark. Currently, I work on mlinspect, a project that I started during my master’s thesis. The goal is to automatically find issues like technical bias in machine learning pipelines written in libraries like pandas and scikit-learn.


I’m reachable via email at