Software systems that learn from data are being deployed in increasing numbers in real-world application scenarios. It is a difficult and tedious task to ensure at development time that the end-to-end ML pipelines for such applications adhere to sound experimentation practices, such as the strict isolation of train and test data. Furthermore, there is a need to enforce legal and ethical compliance in automated decision-making with ML. For example, to determine whether a model works equally well for different groups. To enforce privacy rights (such as the `right to be forgotten'), we must identify which models consumed the user’s data for model training to retrain them without this data. Moreover, model predictions can be corrupted due to undetected data distribution shift, e.g., when the train/test data was incorrectly sampled or changed over time (covariate shift) or when the distribution of the target label changed (label shift). Data scientists also require support for pipeline debugging and for uncovering erroneous data, e.g., to identify samples that are not helpful for the classifier and potentially dirty or mislabeled or to identify subsets of data for which a model does not work well. …