Identifying Data Distribution Bugs in ML Pipelines with mlinspect

Stefan Grafberger, Shubha Guha, Julia Stoyanovich, Sebastian Schelter

February 2021

Abstract

Machine Learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this wide-spread use are garnering attention from policymakers, scientists, and the media. ML applications are often very brittle with respect to their input data, which leads to concerns about their reliability, accountability, and fairness. While fairness cannot be fully automated, we can assist data scientists with identifying certain types of problems. We recently proposed mlinspect, a library that enables lightweight lineage-based inspection of ML preprocessing pipelines, in order to detect hard-to-identify data issues. In this demonstration, we employ mlinspect for the inspection of a representative healthcare ML pipeline, and showcase how to detect data distribution bugs. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines, can handle both relational and matrix data, and does not require manual code instrumentation. During the demo we will provide participants with a pipeline that contains data distribution bugs. They will use mlinspect to visually inspect the pipeline via an automatically extracted dataflow representation and inspect samples of the intermediate outputs of the contained operators. We will then point them to the specific operators in their pipeline that introduce data distribution bugs. Attendees will be tasked to rewrite the code live to fix the problem, and mlinspect will directly reflect the changes to the pipeline code.

Type

Conference paper

Publication

ACM SIGMOD (demo)

Stefan Grafberger

Ph.D. Student

Data management for machine learning