How Flipkart Leveraged OpenEBS for Storage on Kubernetes
Flipkart is India’s leading e-commerce company and one of the world’s fastest-growing large-scale e-commerce companies. Through various acquisitions and product partnerships, the company has grown to become the greatest online retailer in the subcontinent, processing hundreds of thousands of transactions a day. Commanding roughly 40% of the Indian online retail market, Flipkart uses the business-to-consumer selling model, generating large amounts of data daily.
Though Flipkart’s core business falls in the retail segment, it has achieved much of its success based on emerging technologies. Since its beginning, it has been a shining example of a successful technology startup in India. While embracing various advanced technologies to support its business functions, the team at Flipkart sought to explore Kubernetes to ensure optimum hardware utilization for its stateful applications.
This article is based on the talk given by Flipkart’s engineers at the KubeCon EU co-located event Data on Kubernetes Day on May 3. They talked about their journey to Kubernetes adoption, the challenges, lesson learned, and how they leveraged MayaData’s OpenEBS for storage on Kubernetes and other MayaData software to assist in overall management. You can watch the full session here:
All Flipkart services are deployed on a self-managed private cloud spread over two data centers with an additional data center under construction. Before Kubernetes, the services ran only on virtual machines in bare-metal server infrastructure. Both data centers boast slightly over 20,000 bare-metal machines implementing over 75,000 virtual machines. The platform employs the Just a Bunch of Disks (JBOD) storage architecture to store and manage data generated during transactions. The choice of JBOD was primarily because it is spacious, rugged and fast enough to accommodate the storage needs of Flipkart’s big data and other applications.
The platform team supports all the business units and products within Flipkart, spanning from the search for users interested in finding products, to billing for charging the consumers, to analytics to assist in product selection and design, and more. At any given time, there are several dozen different stateful workloads in production at Flipkart, including flavors of relational databases, NoSQL databases, logging, machine learning, caching and solutions that tie together these varied data sources such as Kafka and Pulsar. The engineers building data-centric products within Flipkart are central to the rapid innovation and resultant company growth. Their productivity can result in better experiences for shoppers, better decisions made by merchandising and other teams and increased efficiency of operations.
The following diagram is the topology of the SAS-connected JBOD with virtualization and the use of logical volume manager (LVM) that powers the Flipkart data centers. As the engineers explained, this topology was selected for ease of scalability, the use of familiar components and the ability to access commodity markets for the underlying components.
Kubernetes for Flipkart’s Stateful Applications
The technology teams at Flipkart started to investigate the use of Kubernetes for all important workloads including their dozens of stateful workloads in 2019 and 2020. While they are succeeding in their use of Kubernetes for data, their experience is instructive for many reasons.
The lead engineers anticipated the benefits of orchestrating Flipkart’s stateful applications with Kubernetes would include easier deployment, consistency between deployment and operations environments, increased portability across environments and others. Other factors they cited:
- Kubernetes makes resource allocation between applications seamless, thus optimizing hardware utilization, thereby improving on the use of ever-scarce data center space and power
- Kubernetes uses StatefulSets, ReplicaSets and Deployments to ensure fault tolerance, scalability and availability as needed per workload.
- Developer and data scientist productivity are improved thanks to the use of simple declarative intent via storage classes that abstract away the work necessary to deliver the required storage capabilities. Additionally, because Flipkart has not adopted a shared storage paradigm, each small team and workload is autonomous and need not worry about shared dependencies such as the larger blast radius of shared storage or the risks of heavily loaded shared storage performance degrading unpredictably.
Challenges of Migration
Flipkart reports that already over 20% of stateful workloads are running on Kubernetes. Many benefits that were anticipated are being achieved, including increased density, and developer and data scientist productivity. There are also some unanticipated benefits that have been realized thanks to a greater percentage of the environment being comprised of common open source projects such as Kubernetes and OpenEBS in addition to LVM, thereby ensuring that recruiting and training costs are controllable versus the use of bespoke or proprietary systems. Nonetheless, Flipkart explained, it has aced the following challenges:
- Expertise and onboarding: One of the greatest challenges in Flipkart’s stateful migration was that of porting the expertise of teams running the workloads – that is, familiarizing the teams running the workloads using the Kubernetes platform about how to configure and improve their workload operators and overall how to operate their workloads on Kubernetes
- Workload migration: Most live workloads required copy-based migration to move the underlying data, which is time-consuming, and the lack of buffer capacity potentially introduced downtimes for crucial applications or resulted in the need for a boost in free capacity, thereby slowing down the use of additional workloads that would have otherwise used the capacity.
- Backup tools: A migration to Kubernetes also requires extensive reconfiguration of existing backup tools to fit a containerized setup.
- Variability: Many workloads were settled into their particular environments and, in a sense, knew what performance to expect. Migration introduced variability into the performance received and highlighted the need for increased characterization of workloads and the extension of the underlying OpenEBS to use quality-of-service (QoS)-based scheduling.
Flipkart’s workload migration was comprised of three phases:
- Local persistent volumes: To maintain the data path performance, local PV volumes are used for persistence. Besides maintaining the data path status quo, the use of local persistent volumes also ensures that the huge investment in the deep understanding of Linux LVM is leveraged to the fullest extent.
- Container attached storage (CAS): By enabling developers to treat storage entities as microservices, the use of a CAS phase allowed for a clean separation between persistent volumes and the underlying storage hardware. This makes stateful workloads faster and more portable, simplifying migration.
- Container attached storage deployed as semi-shared storage: This phase allows for the complete decoupling of storage management software from the underlying hardware. This enables the automatic scaling of storage capacity without having to add/remove hardware. In this architecture, replication for durable volumes, efficient data encryption and compression are also enabled.
OpenEBS Partnership for CSI-Compliant Local PV for CAS and Semi-Shared CAS Storage
Flipkart selected OpenEBS, developed by MayaData, to help migrate data and stateful workloads to Kubernetes and improve the ease of use and efficiency of the operations of these workloads. Flipkart also worked with MayaData to add additional features to OpenEBS to improve multitenancy and multipool support, enhanced OpenEBS LocalPV support for HDD devices, and storage capacity and QoS-aware pod scheduling.
With OpenEBS as the CSI-connected container attached storage, Flipkart got access to flexible, fully scalable mounted storage that supported stateful workloads.
Flipkart began its stateful migration journey in 2020 with the benefits mentioned above, including improved hardware utilization. By the time of this writing, the team and the end users from different business units and applications had moved around 20% of its stateful stack to Kubernetes, with the aim of moving the complete stack by September.
Some of the first workloads that are now running on Kubernetes at Flipkart include:
- TiDB or the Titanium Database. TiDB combines the ACID features of SQL databases with flexible scalability, high availability and strong consistency. Deploying TiDB in Kubernetes allows for elastic storage resource configuration, declarative deployment and automated management. The database adapts to the Kubernetes ecosystem to allow fault tolerance through replication. Kubernetes also simplifies the management of TiDB clusters through autoscaling and failovers.
- Another early workload running on the platform is Apache Pulsar, a cloud native messaging service, for consolidated messaging and streaming. Designed as a message-as-a-service offering, Pulsar offers low latency access to local storage (managed by OpenEBS in this case), horizontal scalability, load balancing and multitenant support among other features. Flipkart is using Kubernetes’ built-in scaling and replication features to build a scalable messaging system using Apache Pulsar.
Additional workloads that are in process of being deployed on Kubernetes with OpenEBS storage at Flipkart include MongoDB, MySQL, HBase, Aerospike, Redis and Memcache. The vision is to enable the business units and engineers building applications to have freedom of choice, selecting the right databases for their particular needs.
Some of the key lessons the Kubernetes platform team at Flipkart has learned while working on the migration include:
- Being production-ready: Most datastore operators assume that the database will be deployed only on local clusters and for this and other reasons are not production-ready.
- Managing storage resources: While OpenEBS LocalPV with LVM can automate the creation of underlying pools and thereby improve usage, there remains a risk of fragmentation of local disks, in part due to highly variable workload demands. This can lead to waste of storage resources and put a premium on improved capacity management and capacity-based scheduling.
- Creating a volume group construct: Before recent advancements to the OpenEBS LocalPV LVM engine, there was not a CSI-manageable logical volume management (LVM) that included a volume group construct.
- LVM partition: Backup applications cannot access Kubernetes snapshots, so the only point-in-time available is the volume clone; this raises some challenges when using LVM.
- Disk failure response: Without the use of local RAID, which uses scarce disk space, datastore operators must be able to respond to relatively frequent disk failures when systems are run at the scale of Flipkart.
By migrating its stateful workloads to Kubernetes, Flipkart aims to take advantage of seamless autoscaling, low latency, and the flexibility of running applications on containers as well as other benefits articulated above. During the ongoing adoption of Kubernetes, it has addressed various challenges that come with orchestrating stateful workloads.
The partnership with MayaData and the use and enhancement of OpenEBS to deliver LVM-based container attached storage on local and remote nodes has helped Flipkart to accelerate its adoption of Kubernetes for data while securing many of the desired benefits.