Data comparison post migration

September 13, 2024

How do we ensure data integrity between systems

How to Ensure Data Integrity Between Systems

Migrating data engineering systems across different technologies requires rigorous steps to ensure data integrity. The goal is to verify that data remains accurate and consistent, and that downstream systems consuming this data are unaffected by the migration.

The Comparison Process

Imagine you’ve migrated your entire data warehouse from Teradata Vantage to Databricks. Teradata remains your current production environment, while Databricks acts as the Production Parallel (PP). How can your team and business users gain confidence that everything is functioning as expected in the new system?

There are a few strategies you can implement:

1. Data Consolidation for Comparison

Bring all data into a single platform, perform a column-wise hash, and run validation SQL queries to compare between the two systems. Examples of queries include:

  • Data in Prod, but not in PP
  • Data in PP, but not in Prod
  • Matching records

Break down the comparison based on the granularity of business objects (e.g., customer data). Some key checks could include:

  • How many customers are missing in PP vs. Prod?
  • Are the customers sourced from the same or different sources?

2. Handling Large Data Volumes

If the data size is significant (in terabytes), this process could be time-consuming. In such cases, sampling can help. Focus on larger customers or records that could have a material impact if the data is incorrect. Business users can assist in identifying such critical data points. Once identified, follow the earlier comparison method to validate those samples. Ensure the samples cover all input sources in your data pipeline, especially if customer data is being sourced from multiple origins.

Best Practices and Technologies

There are various ways to approach this task. You can even develop custom tools to perform these comparisons. For example, a custom comparison tool might:

  • Read the table or dataset as input with schema on read.
  • Retrieve dataframes from both PP and Prod.
  • For each column, identify if values for a given surrogate key differ between the two systems.
  • Summarize how many values differ, highlighting columns with the largest discrepancies.

By targeting the column with the biggest difference, you can then sample the data and trace back any discrepancies to their source.

Happy learning!