SRE: Data Integrity

Data integrity usually refers to the accuracy and consistency of data throughout its lifetime. For customer involved online services, things can go even more complex. Any data corruption, data loss, or extended unavailability are considered data integrity issue for the customer.

Data integrity in many cases can be a big problem. For instance, the database table was corrupted and we had to spend a few hours restore the data from the snapshot database. In another instance, the data was accidentally deleted and had a fatal impact on our client, as the client never expected the data to be unavailable. However, it was too expensive to restore the data, so we had to fix the dependent data record and some code on the client side to mitigate the impact on the clients. There is another instance that the data loaded for the client is not what they expected. This is clearly a data consistency issue. However, the issue was not reproducible and thus made it super hard for the team to debug.

There are many types of failure that could lead to the data integrity issue:

Root cause

User actions, Operator Error, Application Bug, Infrastructure defect, Hardware Failure, Site Disaster

Scope

Wide, Narrow, directed

Rate

Big Bang, Slow and Steady

This leads to the 24 combinations of the data integrity issue. How do we handle such issues?

First layer action is to adopt soft delete to the client data. The idea behind soft delete is to make sure that the data is recoverable if needed, for example, from operation errors. A soft delete is usually implemented through adding a is_delete flag and a deleted_at time stamp to the table. When data is to be deleted, they are not deconstructed from the database immediately, but will be marked as deleted with a scheduled deleted time in the future, say 60 days from the deletion. In this way, the data deletion could be reverted if necessary.

There are different opinions about the soft deletion solution, as it might introduce extra complexity on the data management. For example, when there are hierarchies and dependency relationship between the data records, the deletion might break the data constraints. In the meantime, it makes the data selection and option more complex, as a customized filter has to be applied to the data in order to filter out the data that has been soft deleted. And recovering the soft delete data can also be complex especially only part of the data is deleted, a recovery might involve complex data merge.

The second layer action is to build the data backup system and make the recovery process fast. We need to be more careful here that the data backup or archive is not the purpose of data integrity. Find out ways to prevent the data loss, to detect data corruption, to quickly recover from data integrity instance is more important. Data backup is often times neglected as it yields no visible benefit and not a high priority for anyone. But building a restoring system is a much more useful goal.

For many cloud services, data backup is an option, for example, AWS RDS supports creating data snapshot, while the cloud cache Redis cluster supports backup the data on the EBS storage. Many people stop here as they assume that the data is currently back up. However, we should realize that the data recovery could take a long time to finish, and the data integrity is broken during the recovery time. The recovery time should be an important metric for the system.

Besides back up, many systems use replicas. And by failover to the replica when the primary node had an issue, they could improve the system availability. We need to realize that the data might not be consistent between the primary instance and the replica instance.

A third layer is to detect the error earlier. For example, have a data validation job that validates the integrity of the data between different storage systems so that the issue could be fixed quickly when it happens.

SRE: Data Integrity

Leave a Reply Cancel reply

Categories

Archives