Data Quality is Not Auto-magic in a Data Lake or Lakehouse
By Betsy Burton
I have been researching the evolving state of data storage the past few months as I work on the Data Lake and Lakehouse Globes. It is interesting to see organizations rushing to capture every possible data point—from flat files and relational databases to complex and streaming data— embracing the Data Lake and its evolution, the Data Lakehouse.
These architectures offer great flexibility, allowing data scientists and analysts to ingest petabytes of diverse data types quickly.
However, as an old school DBMS person, I am left wondering that 1) as the volume and variety of data swell and 2) as it becomes increasingly easy to integrate data, what are the chances data quality and integrity will get lost?
My answer is……yes……data integrity and quality will continue to be a major overlooked challenge for most organizations.
Simply making data accessible and easily manipulated is not the same as making it valuable. A lake full of bad data is not a resource; it’s an expensive swamp.
Data Quality Challenges of Data Ingestion
The core strength of the Data Lake model—its schema-on-read flexibility—is also its greatest weakness regarding quality.
Tools and pipelines now make it alarmingly easy to dump virtually any data type into the lake, often with minimal configuration. This includes simple flat files (CSV, Parquet), relational data (SQL), hierarchical/document data (JSON, XML), and complex unstructured data.
This ease of ingestion is precisely the risk, as it often bypasses the rigorous quality checks traditionally enforced by a Data Warehouse. Without a strong governance framework, your lake can quickly become a “Data Swamp,” where data is inaccurate, inconsistent, incomplete, and, most critically, lacking integrity.
Data Quality: Incorrect Values and Lost Trust
We classically view Data Quality as the measure of its fitness for use for data. It deals with the correctness, completeness, and value of the data points themselves.
In the Lakehouse, these issues often arise because it is so easy to ingest data into a Data Lake without being standardized or validated against business context:
-
Inaccuracy: A customer’s purchase amount of $42.00$ is accidentally stored as $420.00$. The record is structurally sound, but the value is wrong, leading to flawed financial reporting.
-
Incompleteness: A crucial field, such as
sale_region, is optional in the source system but mandatory for a downstream analysis model. If thousands of records are missing this key attribute, the resulting AI model cannot accurately predict regional demand. -
Inconsistency: Basic issues like customer names are stored as
Last, Firstin one source, but asFirst Lastin another. These semantic inconsistencies prevent unified analysis, making it impossible to confidently run a query across all systems to get a single view of the customer. -
Timeliness: Sensor data used for real-time operational decisions is delayed by two hours. While the data is accurate and complete, its staleness renders it useless for its intended purpose.
Based on my own experiences and talking with tons of clients, failing to address these quality dimensions means that data scientists and analysts spend 80% of their time cleaning data rather than generating insights, directly undercutting the entire value proposition of the Data Lakehouse.
Data Integrity: The Linchpin of the Lakehouse
While accuracy and consistency are crucial, data integrity is the invisible force holding the entire Data Lakehouse together. Integrity refers to the correctness of the relationships between your data assets.
Consider a retail Data Lakehouse storing customer, order, and product data. The risk lies as a classic join problem: if the customer_id column in the orders table doesn’t correctly match the id in the customer table, every single analytical query—from calculating customer lifetime value (CLV) to measuring retention—will be flawed.
Furthermore, as source systems undergo Schema Evolution (e.g., adding a new field), your lake ingestion pipeline must detect and manage this change without breaking the downstream joins that connect the new data with the old. This is a primary function of the Data Lakehouse’s transactional layer. If the integrity of your joins is compromised, the analysis and resulting AI/ML models built on this data will be fundamentally unreliable, leading to poor business decisions.
The Essential Human Element: Governance and Ownership
While automated tools and transactional formats are essential, they are not magic. Data quality and integrity are fundamentally a human responsibility requiring continuous effort and organizational commitment.
This effort begins with Data Governance, the formal process of assigning authority and accountability for managing, using, and protecting data. Specifically, organizations must:
-
Define and Document Quality Standards: Engineers and data owners must collaborate to establish clear, measurable metrics for acceptable data.
-
Assign Data Ownership (Stewardship): A Data Steward—a subject matter expert from the business unit that creates the data—must be formally assigned to each critical data asset.
-
Establish Remediation Processes: When the “Quality-as-Code” checks fail, it’s a person, not the code, who determines the corrective action.
-
Continuous Monitoring and Auditing: Automated tools generate alerts, but humans must interpret the trends. A
This human layer of stewardship and governance turns reactive data cleaning into a proactive, strategic advantage.
The Bottom Line
The temptation to leverage easy-to-use ingestion tools for maximum data velocity is strong. And these tools can really make data access and ingesting easier. However, the effort invested in defining data quality rules and ensuring data integrity is not an optional governance layer; it is the foundational requirement for deriving value from your Data Lake and Data Lakehouse architectures.
Failure to rigorously enforce quality, integrity, and human stewardship turns a state-of-the-art data platform into a massive liability, proving that even with the most powerful tools, the age-old principle of “Garbage In, Garbage Out” still applies.


Have a Comment on this?