In Demystifying Cloud-Native Data Management: Layers of Operation we covered the building blocks for storing state in a cloud-native application. Now let’s look at the various data management options to enable use cases like Backup and Restore, Application Mobility, and Disaster Recovery. Data management is a critical Day 2 service that enterprises need for business continuity during an accidental or malicious loss of data in addition to meeting compliance requirements. Additionally, efficient data mobility is a critical data management necessity to ensure that enterprises have the freedom to choose the best physical storage, data service, or cloud vendor.
Flavors of Data Management
We can categorize the data management approaches to protect data based on the layers we explored earlier. There are, of course, variations and one can also mix and match the approaches highlighted below. However, let’s enumerate the broad strokes approaches that will arm us with the additional vocabulary to then tackle the corresponding pros and cons.
1. Storage-Centric Snapshots
In this approach, the snapshot capability provided by the underlying file/block storage implementation is exercised. This approach is transparent to the data service layer such as MySQL. While the snapshot implementation differs across vendors, the high-level flow includes:
- Making a metadata marker of the state of the disk to create a snapshot at that instant of the volume. Since there is no awareness of the data services layer, in-memory buffers that haven’t been flushed to disk, open files, pending writes, etc. by the data service are not considered.
- The bulk data from the snapshot could be stored on the same storage system as the original or in a different storage system such as an object store. For subsequent snapshot operations, most implementations take an incremental approach, copying only the delta from the last snapshot operation.
The result of this snapshot is often alluded to as being "crash-consistent" and the consistency achieved is similar to that when pulling the power plug. However, the results are dependent on the underlying block/file implementation and its error handling capability when certain in-memory processes were discarded. Most modern file systems (e.g., ext4, NTFS) use journaling and can handle this. However, this might not be good enough when you need transaction-level granularity. Another consideration of the snapshot approach is that when you restore, you need to restore the entire storage device vs. individual records or entries. Finally, given that failure domains are another key consideration for a good protection approach, one needs to be mindful that this storage-centric solution does not lock you to a single storage technology or vendor.
2. Storage-Centric with Data Service Hooks
In this approach, the data protection process like in the earlier method depends on the storage layer’s snapshot functionality. However, the data protection process makes a call into the data services layer to freeze and unfreeze the specific data service. Most data services support an API to allow these hooks to quiesce and un-quiesce the database in addition to flushing in-memory buffers. A typical top-level flow initiated by the data protection process includes:
- Freeze & Flush the Data Services layer
- Initiate a Storage-layer snapshot
- Unfreeze the Data Services layer
- Record completion and status of the Snapshot process
An example of this approach in the non-native cloud world is the Volume Shadow Copy Service (VSS) introduced way back in Microsoft Windows Server 2003. The primary advantage of this backup approach is that there is a better level of consistency while being quick.
The disadvantage of this approach is that the data service is not available to the higher-level application during the time it is frozen. Hence the time between the freeze and unfreeze needs to be minimized. Additionally, as with the earlier approach, the restoration process is not very granular and one needs to be mindful of failure domains and storage vendor lock-ins. Finally, there is additional error-recovery code that needs to be written and tested to ensure that the data service is never stuck in a frozen mode.
3. Data Services Centric
This approach involves a more top-down approach where the data protection system is not just aware of the data services layer but is tied and specific to the particular database implementation. There is a rich choice of tools and utilities available both in the open-source domain (e.g., pg_dump for PostgreSQL) as well as commercial offerings from various database vendors (e.g., Atlas Backup for MongoDB). The top-level flow in this case includes:
- Based on the data service, pre-select the database protection tool or service
- Optionally operate on a database replica node to minimize impact on data service consumers
- Invoke the database protection tool at select intervals on selected nodes
- Record the protection tool’s job completion status, which will be specific to the tool invoked
Advantages of this approach include potential storage space savings since the process is database aware and can use database-specific compression techniques such as only gathering incrementals since the last backup. Additionally, there is no dependency on the underlying file/block storage layer to support snapshotting capabilities, which is the case with locally attached storage in Kubernetes environments or ephemeral volumes in cloud environments such as NVMe.
While this approach allows you to abstract away storage implementations, it does not abstract away the database. Also, while there may be higher consistency for that specific data service vs. a snapshot, there is an added access impact to the consumers of the data service while the protection process is in execution. There are several tools of the trade where this impact can be minimized - for example acting on one of the database replicas so that the master node continues to be available to the database consumer. However, again this is database dependent approach where the consistency does not extend to other related ancillary services and is not a panacea to data protection. Finally, recovery with this approach can be complicated and can take significantly longer than the other approaches highlighted above.
This approach is where things get interesting. Enterprises typically care about business continuity of an application, not the individual piece parts. Also, a backup is only as good as the operational ease of restoring that application when required. Earlier we addressed the makeup of a "stateful application" that illustrated that, under the hood, a typical cloud-native application is composed of microservices with multiple data services, storage systems and related application components including Kubernetes objects.
This microservices explosion coupled with a Kubernetes-based orchestration system that does not bind applications to specific nodes makes data protection for cloud-native applications very different from traditional server or hypervisor solutions. Consequently, enterprises in need of low recovery and restore times coupled with operational ease for these growing cloud-native applications need to take a fresh look at selecting a data management approach. The high-level flow for application-centric data management approach includes:
- Discovering the application components using Kubernetes constructs like namespaces, labels, etc.
- Setting a policy on the application in terms of backup frequency, retention rates, etc.
- Applying the policy to all the components in an application-consistent manner which includes a coordinated operation of quiescing the application.
- Using the methods and tools provided by both the underlying storage and data services layers to backup application data, definitions, and other stateful constructs.
- Restoring all application components keeping into account any ordering and sequencing dependencies.
The advantages of this approach include a higher level of consistency and flexibility in working across multiple implementations of storage and data service vendors, giving enterprises the freedom of choice. An additional advantage includes the separation of concerns, whereby the IT/Operations team can independently set compliance policies on the application without requiring the developers to make code changes to enable data protection. Finally, this approach provides application mobility that can be extended across clusters,regions, and clouds.
The disadvantage of this approach is that this is a relatively new construct that has come into existence with the rise of cloud-native applications. However, with stateful Kubernetes applications becoming ubiquitous, this concern is fast disappearing.
Summary / TL;DR
We built on the vocabulary defined in Demystifying Cloud-Native Data Management: Layers of Operation that deconstructed the definition of state in a cloud-native application to explore flavors of data management functions like backup and recovery and portability. Flavors of data management introduced here to protect application state included:
- Storage-centric snapshots provided by the underlying file or block storage,
- Storage-centric with data service hooks that spans across storage and data services layers,
- Data service-centric approaches that uses database specific utilities, and finally
- Application-centric that exercises all the above capabilities in a coordinated manner.
Each approach has its own pros and cons in terms of speed, consistency, and costs. and, so, the optimal approach will depend not only on the capabilities available but also on the specific application needs in terms of backup and recovery objectives and compliance needs. However, regardless of the approach, the unit of atomicity for a good data management solution needs to be application-centric (and not storage or data service specific). This is exactly how Kasten, the leader in cloud-native data management approaches this challenge.