Data / Kubernetes / Monitoring / Sponsored / Contributed

How to Move Past Elasticsearch’s Scaling Challenges

23 Mar 2020 11:18am, by

LogDNA sponsored this post.

Ryan Staatz
Ryan Staatz is the head of DevOps at LogDNA, where he migrated the company’s infrastructure from VMs to Kubernetes. His team partners with large enterprise companies, such as IBM, to establish stability across deployments, expand LogDNA’s compliance repertoire and improve observability at scale. Ryan presents frequently on scaling Elasticsearch on Kubernetes, handling challenges with a multicloud infrastructure, running Kubernetes on bare metal and managing dozens of separate production environments. Prior to LogDNA, Ryan has worked at enterprise companies like WhatsApp and continues to have a passion for his contributions at startups. Ryan holds a BS in human biology from Stanford University.

Data search and analytics are an essential component of modern applications.


Enterprises need the ability to store and query data collected from users, applications, infrastructure and other sources in real-time. However, the data generated by these sources can reach significant volumes — especially as the enterprise grows.

Elasticsearch, an open source search and analytics engine built on Apache Lucene, has served as a popular solution to this data problem. Its popularity is due, in part, to how it is capable of ingesting unstructured data and its resiliency in the event of failure through clustering. As part of the open source Elastic Stack, it has become the default data analytics tool for many enterprises, including LogDNA. With enterprises generating and consuming ever-increasing amounts of data, these limits can quickly become barriers to growth.

At LogDNA, we developed a solution for scaling Elasticsearch to petabyte scale. Our solution leverages Kubernetes to automate the deployment, scaling and maintenance of Elasticsearch nodes across many cloud and on-premise platforms.

In this post, we will outline this approach and provide detailed information about the configurations, techniques and optimizations we used to achieve our level of scale.

Why K8s

Kubernetes has become the world’s leading container orchestration platform. It allows teams to automate workload scheduling, deployments and scaling. LogDNA uses Kubernetes extensively across many different platforms. We utilize all of the features of Kubernetes and a number of available tools to ensure that our environments are as replicable as possible. To manage the large datasets we handle daily and to provide a swift search option for our customers, we also run Elasticsearch. As fans of both Elasticsearch and Kubernetes, we wanted to combine both solutions by packaging and deploying Elasticsearch as a containerized application.

However, running Elasticsearch on Kubernetes is not easy. Running Elasticsearch on any platform at scale is difficult because there is a tipping point where the incoming data oversaturates the nodes and Ops teams need to carefully manage the overwhelmingly large number of indices and shards without fragmentation of the data. On Kubernetes, however, Elasticsearch requires additional configuration and management due to the transient nature of containers and the need for persistent, separate volumes to make the cluster stateful or able to maintain a memory of its history from restart to restart.

If you can get Elasticsearch up and running on Kubernetes successfully, you gain several benefits. Kubernetes’ native ability to scale workloads and distribute loads over hardware enables you to manage much larger datasets swiftly and efficiently. In addition, you can take advantage of version control and code review to manage your configurations, ensuring easily reproducible environments. Finally, the maintenance of such large clusters is much easier when you can use the tooling that is available for Kubernetes itself.

How to Deploy

There are many considerations you need to take into account when you run Elasticsearch on Kubernetes. These range from understanding the hardware requirements and the difference between local and networked storage to figuring out your fault tolerances and multisite data access setup.

When it comes to hardware, you really have two options: virtual machines or bare metal. Virtual machines allow you more flexibility and lower maintenance costs, but they’re expensive. Bare metal servers, on the other hand, are much cheaper but require a lot more maintenance on your part and can be hard to deploy in a hurry. In both cases, you can run Kubernetes yourself. If you choose virtual machines, you also can get a managed Kubernetes platform.

The other big difference between the two hardware options relates to storage. Virtual machines generally require networked storage. Networked storage often “just works” on hosting providers with Kubernetes, which can be a big draw when you might be overwhelmed by the maintenance of a system. In addition, using scheduling is a lot easier because the storage is not enmeshed with a specific node, meaning that should a node fall over and restart, the storage is not affected. The performance of networked storage is rate-limited by the speed and availability of the network where it is connected. As a result of the network connectivity needs and since it is generally also managed by a provider, networked storage also costs more than local storage.

Bare metal options, on the other hand, often involve local storage. Local storage is not a plug-and-play option for Kubernetes most of the time. It requires management to get that storage up and running for a Kubernetes cluster. In addition, local storage is, by definition, connected to a specific node. If that node fails, the storage does, as well.

On the other hand, local storage is a lot faster as it does not depend on a network’s connectivity speeds. As a result, local storage is also a lot cheaper. We chose local storage for performance reasons, and we decided to work with a vendor that specializes in dynamic local storage for Kubernetes clusters on bare metal. Using a vendor also sidesteps the issues with basic Kubernetes local storage, like the lack of encryption, and with a DIY solution, like the difficulty of getting disks provisioned quickly.

How to Manage

All Kubernetes objects are defined using YAML files. At LogDNA, we use ConfigMaps to manage cluster-wide configuration settings and StatefulSets to define the Elasticsearch Pods. Our ConfigMaps are used to bootstrap each pod and configure all of the necessary variables that ensure our clusters stay up and running. The StatefulSets, on the other hand, ensure that our Elasticsearch pods continue to function in the master/node relationship as Elasticsearch requires. We actually run three separate StatefulSets: a master, hot nodes and cold nodes. The hot nodes handle active indices, and the cold nodes handle older, less resource-intensive indices.

When planning out how you will manage your systems, templates for your ConfigMaps will be your most useful tool, especially if you have multiple environments to consider. The templates for our systems use standard Go templating syntax to populate the various fields to match each environment or customized setting. Templating also has the advantage of allowing us to use semantic versioning. This ensures everyone is on the same page when we run deployments or are otherwise managing our clusters, and version control, which allows us to manage our configuration just like we are managing our deployment.

To increase the resilience of your Elasticsearch system on Kubernetes, you should consider your fault tolerance needs and the balance between data residency and data access. Our StatefulSets help us manage these considerations across environments. We use anti-affinity rules to balance our pod distribution across datacenters and affinity zones with the various hosting providers where we run, which increases our fault tolerance and availability in each of those providers’ networks.

Elasticsearch on Kubernetes, in particular, needs a system with a high availability due to how it integrates with storage and Kubernetes services. Our Elasticsearch StatefulSets are tuned to ensure that Elasticsearch pods get priority on servers, which is crucial since Elasticsearch is very resource-intensive. Finally, we ensure that there’s always an Elasticsearch pod running by defaulting to a RollingUpdate strategy, which reduces the impact of any new deployments on our infrastructure and our customers.

In the future, we expect deploying Elasticsearch over Kubernetes will become even easier. Recently, Elastic released Elastic Cloud on Kubernetes (ECK) containing the entire Elastic Stack. ECK is built on Kubernetes Operators, which can be used to automate the maintenance of custom resources and deployments. More recent versions of Elasticsearch and Kubernetes also offer improved support for federated search and preemption. The speed at which both projects are updated makes evaluating, testing and performing upgrades a continuous process.

To learn more about how we scale Elasticsearch on Kubernetes, check out the corresponding webinar presented by LogDNA.

The Cloud Native Computing Foundation, which sponsors The New Stack, is a sponsor of The New Stack.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.