Deequ - Data Quality Validation for Machine Learning Pipelines

Abstract

Modern machine learning (ML) systems are comprised of complex ML pipelines which typically have many implicit assumptions about the data they consume (e.g., about the scales of variables, the presence of missing values or the dictionary of categorical values). Violations of these assumptions can result in crashes or wrong predictions. We therefore present Deequ, a library that allows users to explicitly specify their assumptions about the data in a declarative way. Deequ enables the efficient automatic validation of these assumptions on large datasets. It is an open source library based on Apache Spark and meets the requirements of production use cases at Amazon.

Publication
Machine Learning Systems workshop at the conference on Neural Information Processing Systems (NeurIPS)
Stefan Grafberger
Stefan Grafberger
Ph.D. Student

Data management for machine learning