Stefan Grafberger

Ph.D. Student

BIFOLD & TU Berlin

Biography

I am a Ph.D. student at BIFOLD and TU Berlin in the DEEM Lab, conducting research at the intersection of data management and machine learning. I mainly publish at conferences like SIGMOD and VLDB.

My Ph.D. advisors are Sebastian Schelter and Paul Groth. I work on responsible data management (also in collaboration with Julia Stoyanovich). I spent the first three years of my Ph.D. at the University of Amsterdam in the Intelligent Data Engineering Lab, before Sebastian transitioned to TU Berlin. Before my Ph.D., I did my masters at TU Munich with Thomas Neumann and Alfons Kemper and focused on databases.

During my studies, I interned with Microsoft GSL, Amazon Research, Oracle Labs, and worked as a research assistant at TU Munich.

News

I will be co-organising the workshop on Data Management for End-to-End Machine Learning (DEEM) at SIGMOD 2025 in Berlin.
I presented an overview of my research during a BIFOLD lunch talk.
After three amazing years in Amsterdam, I followed my advisor Sebastian to TU Berlin to finish my Ph.D. there. Very excited to join the data management community in Berlin!
I was featured in an ICAI blogpost.
I presented an overview of my PhD research at the Microsoft GSL talk series.
I did a research internship in Redmond with the Microsoft Gray Systems Lab this summer.
I presented a poster on Towards Declarative Systems for Data-Centric Machine Learning at DMLR at ICML.
Sebastian, Shubha and me won an ACM SIGMOD Best Demo Runner Up Award for our demo on Proactively Screening Machine Learning Pipelines with ArgusEyes.
We open-sourced our prototype StreamDQ, a library built on top of Apache Flink for defining “unit tests for data”, which measure data quality in large data streams.
I gave a talk about Automating and Optimizing Data-Centric What-If Analyses on Native ML Pipelines at ETH Zurich.
I gave an in-depth talk about the inner workings of mlinspect at the Dutch Seminar on Data Systems Design.
Sebastian and me gave a talk on Data Provenance as the Foundation for Data Governance at the virtual Data Centric AI workshop organised by Stanford and ETH Zuerich.
We presented our work on mlinspect at the ICAI Lunch Meetup.
mlinspect will be used for NYU’s course on Responsible Data Science to teach students about the importance of data preprocessing, making data distribution debugging part of their methodological toolkit.
I joined UvA as Ph.D. student to work on data validation for ML pipelines.

Recent Publications

All publications

Rana Alotaibi, Yuanyuan Tian, Stefan Grafberger, Jesus Camacho-Rodriguez, Nicolas Bruno, Brian Kroth, Sergiy Matusevych, Ashvin Agrawal, Mahesh Behera, Ashit Gosalia, Cesar Galindo-Legaria, Milind Joshi, Milan Potocnik, Beysim Sezgin, Xiaoyu Li, Carlo Curino (2025). Towards Query Optimizer as a Service (QOaaS) in a Unified LakeHouse Platform: Can One QO Rule Them All?. Conference on Innovative Data Systems Research (CIDR).

PDF

Sebastian Schelter, Shubha Guha, Stefan Grafberger (2024). Automated Provenance-Based Screening of ML Data Preparation Pipelines. Datenbank-Spektrum.

PDF

Stefan Grafberger (2024). Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans. PhD Workshop at VLDB.

PDF

Sebastian Schelter, Stefan Grafberger, Maarten de Rijke (2024). Snapcase - Regain Control over Your Predictions with Low-Latency Machine Unlearning. VLDB (demo).

PDF

Stefan Grafberger, Paul Groth, Sebastian Schelter (2024). Towards Interactively Improving ML Data Preparation Code via “Shadow Pipelines”. Data Management for End-to-End Machine Learning workshop at ACM SIGMOD.

PDF

See all publications

CV

I am a Ph.D. student at BIFOLD and TU Berlin in the DEEM Lab, conducting research at the intersection of data management and machine learning. I mainly publish at conferences like SIGMOD and VLDB.

My Ph.D. advisors are Sebastian Schelter and Paul Groth. I work on responsible data management (also in collaboration with Julia Stoyanovich). I spent the first three years of my Ph.D. at the University of Amsterdam in the Intelligent Data Engineering Lab, before Sebastian transitioned to TU Berlin. Before my Ph.D., I did my masters at TU Munich with Thomas Neumann and Alfons Kemper and focused on databases.

During my studies, I interned with Microsoft GSL, Amazon Research, Oracle Labs, and worked as a research assistant at TU Munich. I also interned and worked as a working student at TNG Technology Consulting in Munich and worked as a teaching assistant at University of Augsburg.

In the past, I have been working on deequ, a library for ‘unit-testing’ large datasets with Apache Spark, PGX, an in-memory graph analytics framework, and Umbra, a disk-based database with in-memory performance. Currently, I work on mlinspect and mlwhatif. The goal is to diagnose and mitigate robustness and reliability issues in machine learning pipelines.

Contact

I’m reachable via email at grafberger@tu-berlin.de.