1

Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines

Machine Learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this wide-spread use are garnering attention from policy makers, scientists, and the media. ML applications are often very brittle with respect to …

Differential Data Quality Verification on Partitioned Data

Modern companies and institutions rely on data to guide every single decision. Missing or incorrect information seriously compromises any decision process. In previous work, we presented Deequ, a Spark-based library for automating the verification of …

Deequ - Data Quality Validation for Machine Learning Pipelines

Modern machine learning (ML) systems are comprised of complex ML pipelines which typically have many implicit assumptions about the data they consume (e.g., about the scales of variables, the presence of missing values or the dictionary of …