Why VM-based Data Management Doesn’t Work in a Cloud-Native World
We are often asked why traditional data management platforms that have worked extremely well for virtualized environments have failed to keep pace with the requirements of applications deployed on container orchestration platforms such as Kubernetes. In particular, as more stateful applications get deployed on cloud-native platforms, the requirement for data management functions such as backup and recovery, disaster recovery, and application mobility is becoming extremely pressing.
As a part of our research when building K10, a cloud-native data management platform to meet the needs mentioned above, we have discovered that there is a fundamental impedance mismatch between solutions built for VMs vs. those that have been purpose-built for cloud-native platforms and applications. In fact, given the baked-in architecture of a number of these legacy and often hardware-based solutions, it would be very hard to retrofit them without exposing wide gaps and plenty of sharp edges. After all, they would need to be rearchitected to not only reflect very significant technical shifts but would also require a fundamental rethink to mirror the DevOps philosophy that is changing engineering organizations for the better.
The following sections dive into the details of this mismatch and will walk you through some of the complexity under the hood. More importantly, it will also focus on the shift in roles and responsibilities in development teams today and walk through the people and technology-related impact of this organizational shift.
New Application and Infrastructure Design Patterns
Two fundamental shifts have taken place in application and infrastructure design patterns in cloud-native environments. First, new applications are being designed as microservice-based systems that have a large number of components compared to traditional monolith systems. Second, apart from being deployed as containers, these applications are managed via container-orchestration platforms that have very different scheduling, visibility, and resiliency behaviors when compared to server virtualization systems. Some of the differences that show up because of these shifts include:
Application Deployment and Distribution: One of the big deployment changes for containerized applications is that there is no mapping of applications to nodes (a VM or a server). For fault-tolerance and performance, a container orchestration framework will use its own placement policy to distribute application components (e.g, containers, software load balancers, etc.) across a number of nodes in the cluster. Further, different applications will often be co-located on the same node. Traditional data management systems that operate at the server or VM level will fail as they can never independently capture the state of just a given application without pulling in unrelated applications. This overly broad approach will suffer from significant resource overheads and consistency loss.
State and Services Explosion: With the growth of microservices and serverless frameworks, what used to be a single application has now been broken up into hundreds of discrete components that can have independent lifecycles. VM-focused solutions were not built to handle the scale of millions of components found in a large production cluster. State that might have earlier shown up on disk (e.g., configuration) is now broken up into many configuration objects that are only visible at the container orchestration level.
Given their infrastructure focus and lack of visibility into the application, VM-based solutions might be able to capture data that resides on disk volumes but are simply unable to gather any of the other configuration, secrets, or application state that makes up the majority of the application. Even when data on disks can be gathered, there is no easy way to tie it back to applications for later use. For example, in a recovery scenario, the right volume grouping would need to be manually recreated from a sea of volume backups, hand-crafted mappings in Kubernetes created for these restored volumes, and then associated with any leftover or manually recreated application configuration.
Secure Namespacing: One of the big benefits of container orchestration frameworks like Kubernetes is the security primitives available out of the box. This includes the ability to use Role-Based Access Control (RBAC) to restrict application access or even network policies that, for improved security, deny access to applications and their associated data services from outside the cluster or even the namespace.
However, traditional data protection products usually run outside of your Kubernetes clusters and this lack of visibility prevents them from discovering or accessing application data. This has lead to the emergence of hacks where VM-based solutions try to make every pod or container look like a VM by giving them external IP addresses which lands up significantly weakening the security of the applications they are meant to protect. Alternatively, they try to remount all volumes within a special container but this multi-reader-writer sharing is simply not possible with most block storage systems.
Dynamic Autoscaling and Rescheduling: Another constant for cloud-native applications is the dynamic nature of the application. Applications can auto-scale in response to load, containers can be dynamically rescheduled on different nodes for better load balancing, deployments, that happen on an hourly basis, can involve rolling upgrades and new application components can be added or removed at any time. In short, there will never be a static definition of a cloud-native application. With this highly dynamic shuffling, there is no IP address stability and they will likely change with each container restart.
Given that virtualized environments never see such a significant rate of change, VM-based solutions contain relatively inflexible policy definitions and VM groupings and dependencies on stable IP addresses. Even systems that leverage VM-level tagging aren’t effective as there is no mapping from VMs to applications anymore. This leads to constant operator pain and process failure when applied to cloud-native infrastructure and applications.
Polyglot Persistence: Another very interesting trend that has coincided with the growth of cloud-native applications is that of polyglot persistence where multiple data services (e.g., MongoDB and Cassandra) are used together in the same application and often with replication enabled for high availability. With the shift from using a single data service to multiple independent ones, we now need intelligence to extract a single copy of data from replicated systems, the ability to do this at runtime from a secondary system so as not to impact primary performance, and the ability to do this with consistency (both within and across services) for systems that might be eventually consistent. Unfortunately, without application knowledge, solutions that are limited to an infrastructure-centric view will simply not be effective.
Elastic Infrastructure: Finally, we have discovered that most traditional backup systems were created to run on servers or VMs. Scale-out, achieved at the granularity of adding more “boxes” to a deployed system, is cost-inefficient as addressing a single bottlenecked component requires adding other relatively unused resources. These server-centric architectures fail to take advantage of a cloud-native platform where an effective data management system can dynamically scale up (or down!) the components that are in heavy (or light) use.
The Rise of DevOps
There is an extremely strong correlation between cloud-native uptake and the adoption of a DevOps software development practice. While DevOps comes with a new set of tools and practices for increased automation, it is a culture shift at its core. DevOps now allows teams to develop, test, deploy, and improve software much faster than ever before.
A part of this shift has involved ceding control over both infrastructure and deployments to the developer and, within the context of this post, it has made it hard for admin-focused VM-based systems to work in DevOps environments. Some of the more concrete shifts seen with cloud-native platforms include:
Application-Oriented Cloud-Native Platforms: The success of Kubernetes has been driven by the fact that they are fundamentally oriented towards supporting developers and their agile application development cycles. While seemingly subtle, this focus permeates the system design of Kubernetes and requires that data management be aligned to the application and not the underlying infrastructure. It no longer makes sense to protect the VMs or servers that Kubernetes runs on. In fact, the focus of data management should be centered on the application as that is ultimately what both the developer and operator really care about!
Developers Owning The Full Stack: With platforms such as Kubernetes, developers can define not just the entire application topology but also infrastructure as code. This has led to increased agility, but also removed the need for legacy, and often manual, change management processes. New data services can be deployed without needing prior operator input or approval. This pattern not only requires flexible yet targeted data management policies but also the ability to put the policy definition and data lifecycle actions in the hands of the developers. However, as one quickly discovers, this is often not found in VM-centric systems that were primarily designed for an operator-centric world.
ITOps Move to Self-Service: Given the combination of application-oriented infrastructure, application explosion seen as a part of digital transformation, and the embedding of operations skills within developer groups, traditional IT is transitioning to an “ITOps” role where they heavily focus on providing self-service capabilities. While a lot of the self-service was initially focused around infrastructure deployment, data management requires a similar shift where, instead of admin-centric VM-based solutions, scoped control needs to be handed off to developers for the applications they own. To satisfy these self-service requirements, solutions need to support features that admin-centric data management systems were never built for, such as fine-grained RBAC, enhanced security and encryption capabilities, and integration of data management tools into CI/CD pipelines.
Requirement for Cloud-Native APIs + Ecosystem Integration: As we see dev and ops closely co-operating to take on the joint responsibility of data management, the tools they use need to reflect the API-first philosophy adopted in application design. Apart from RBAC control for these APIs, these APIs need to be cloud-native themselves as, surprisingly, REST isn’t enough or even adequate in a cloud-native world. For example, with Kubernetes, APIs based on native Custom Resource Definitions (CRDs) will allow for first-class integration with authentication and authorization, will leverage native Kubernetes machinery to simplify operations, allow the use of tools (e.g.,
kubectl) both developers and operators are already familiar with, and present a single common API surface.
Further, it is not just about APIs, but about requiring solutions to interoperate with the rest of the cloud-native infrastructure ecosystem to deliver a far superior UX, reduce management overhead of multiple tools, and provide integration with the cloud-native tools both developers and operators have become accustomed to. Concrete examples today include integration with Prometheus for monitoring and altering and systems such as fluentd for logging.
CI/CD Pipelines: Finally, with the widespread adoption of CI/CD pipelines into development processes, simply testing stateless components early on the in the development cycle is not enough. There is an increased ask for bringing production data into the CI/CD process. This helps with not just functional testing but also with a host of other improvements such as performance analysis, debugging, and dynamic security testing. Achieving this with traditional tools would be costly, in time and money. An application-centric approach with the right access controls, would decreases cost and complexity while further hardening the application development process. Additionally, a new application-centric approach can help mirror the shift of breaking down organization boundaries (e.g., dev vs. ops) by breaking down data silos (e.g., backup vs. test/dev copy data management).
The Path Forward
The above sections highlight a number of reasons why VM-centric solutions don’t work well, if at all, for applications deployed on cloud-native platforms such as Kubernetes. Given that we, at Kasten, had the ability to start with a clean slate, a future post will address how we tackled the above technical and organizational shifts to meet our customer’s data management requirements. We believe that our fundamental rethink of how data management should operate for cloud-native applications with an application-first, cloud-native, and software-only approach will be mirrored in many other areas, such as security, monitoring, and networking as containerized applications evolve.
Niraj Tolia is the General Manager and President of Kasten (acquired by Veeam), that he founded in order to solve the problem of Kubernetes backup and disaster recovery. With a strong technical background in distributed systems, storage, and data management, he has held multiple leadership roles in the past, including Senior Director of Engineering for Dell EMC's CloudBoost group and VP of Engineering and Chief Architect at Maginatics (acquired by EMC). Dr. Tolia received his PhD, MS, and BS in Computer Engineering from Carnegie Mellon University.