ELK and the Role of Open Source in the Enterprise
I’m always curious what the triggers for open source are in the enterprise. We all have ideas, and I hear many things from customers and users of open source daily in discussions. As an analyst, it always felt like an organic uptake of the technology by an engineer which then continues to expand. As we say “engineers want open source,” but it’s more like engineers prefer using open source to solve the problems they experience.
As an engineer myself, I found that solving a specific issue was easier to do when using the most accessible tools which can be easily downloaded from the community. For example, it often forces people who work with systems to use archaic editors like “vi” (which we grow to love). The proliferation of DevOps and the decentralized manner in which teams operate today often affords a lot of autonomy granted to individual team members on technology choices. This recent 451 Research survey data shows the battle between a centralized and a decentralized approach:
There are loads of tools and technologies these teams use. Some tools in use are standardized, but many teams will use the tools they want to in order to solve the problems at hand. Enterprise standards are created and designed to avoid problems down the line, but they also advance slowly, often not keeping pace with the level of change in the application or infrastructure stack. Many teams will leverage and stand up open source tools for monitoring, automation, and other components to solve problems. While automation is a whole topic unto itself, monitoring tools are popular choices for open source as this 451 Research survey shows.
The most popular open source tools include the ELK search stack, Grafana, Prometheus, Jaeger, and many other tools which are part of the observability ecosystem.
These independent instances create a lot of challenges and waste in the typical organization. This is inefficient, but teams must progress with or without the centralized IT groups. The problem becomes a larger issue when there is a security issue, acquisition, departure of talent, or the need to allocate effort towards building new things. The quality of the open source stack can suffer, which results in major problems.
This is most commonly seen with the ELK stack that has been in use since around 2012 with rapid adoption to reduce the use of expensive proprietary software to do log aggregation. ElasticSearch has become a popular database and backend to index all kinds of content, including log files. It’s understandable that when Jaeger was open sourced, that one of the first things added was ElasticSearch support.
This allowed users to run a single database for logs, traces, and other search requirements. Soon after, ElasticSearch was the most popular backend used for Jaeger. However, the baffling licensing change that Elastic has made to increase revenues will cause the community to move towards a new Apache 2.0 licensed fork. The users of these open source tools will not stand for restrictive non-open source licenses to be used, especially the SSPL.
The balance and need to index data versus build more clever ways to recall the data has been a constant change in big data platforms. For example, data within ElasticSearch provides rapid and broad data search and recall, but it could be faster with other targeted approaches to query data such as technologies like Loki. The new fork of ElasticSearch will focus on optimizing the cost to operate, which means lower memory consumption and lower storage consumption. This focus will make the technology more attractive to those building on top of the technology as well as those operating it.
The other drawback is that ElasticSearch is tightly coupled to Kibana. This means that you must upgrade both together, and there are breaking changes between versions, creating many challenges to those deploying and maintaining these great technologies. The new fork of Kibana will fix this, providing engineers more freedom to use the tools they want to use.
We use ElasticSearch as a core technology to deliver Logz.io, and we’ve been doing so for several years. Many of our users were trying to run or scale ELK, Grafana, and Jaeger internally, but they often run into complex issues. After all, detecting and troubleshooting is a challenge even for organizations with outstanding engineering talent. It’s critical that a monitoring system has higher uptime than the services and infrastructure it’s monitoring. If the monitoring system is unavailable and your system is unavailable, troubleshooting becomes difficult if not impossible. The other challenge is that when the systems being monitored have problems or are under attack, they often send large amounts of data, especially logs which in turn can tax observability systems that cannot adapt to the demand, often resulting in outages of the observability system.
Let’s have a look at some concerns we often hear from users who are managing ElasticSearch themselves. These challenges are also relevant to many of the cloud services for ElasticSearch since they often require that you manage aspects of the cluster, data management, or other technologies that are part of the stack:
- The first step is getting data into the observability systems. ElasticSearch can be an enormous challenge, since it must map the data into a supported format and schema within the backend. If fields do not match, data will be discarded or the cluster may perform poorly. Many organizations are missing most of their data because of this problem. We’ve taken a unique approach to re-ingesting and reformatting data in our Kafka layers, but this is all custom-developed software to deal with this issue.
- Maintenance tasks include keeping systems up to date, upgrading Kibana and ElasticSearch along with Kafka and other components. Upgrades are critical for security patches and for improvements in the software. Often, you’ll want a staging or test system to ensure upgrades work. This takes more time away from other tasks.
- Data balancing tasks include migrating indexes from faster to slower nodes for archiving purposes or to save costs in the cluster by moving data to spinning disk or other lower-cost storage nodes.
- Common performance issues are first identified by implementing metric collection (most often with Prometheus). Then, the monitoring of ElasticSearch and Kafka will give you insights into these problems. Logz.io Infrastructure Monitoring is a superb choice to collect this data too!
- The first sign of performance issues is when Kibana gets sluggish. This is normally because of not having the right ElasticSearch nodes for query or data access. It can also show a bottleneck in disk IO. These hardware changes and cluster changes are often complex to isolate and fix.
- The cluster may be in good shape, but the data is not being stripped the right way. Adjusting sharding and replication is often something you must do even with SaaS services and can cause performance issues or query problems.
- In the Kafka layers, queues can begin building up, showing delays in writes to ElasticSearch. Sometimes data may be dropped because of resource constraints or lots of delays in when the data is accessible.
As you can tell these types of performance and maintenance issues can be time-consuming and difficult to fix. Running multiple monitoring systems to monitor each other is a difficult task for teams to take on, which is why the use of SaaS services is growing in popularity. Still, doing it yourself with open source is the most common way to solve the need for observability and monitoring.
We hope that the new fork and our combined community brain will correct and improve many of these challenges. Amazon Web Services (AWS) has done a lot of this work in OpenDistro already, but having a better foundation to build on top of will help immensely.
Similarly, there are other issues that require troubleshooting on open source Grafana, Jaeger, and other components in use, especially the need for maintenance tasks. These can take engineers and developers away from tasks that add value to the business. Over time, the management of open source monitoring can be a major distraction and resource drain.
At Logz.io we make sure that users don’t have to deal with these issues, we manage all of the overhead for our customers, from the free users to our largest customers. We do this by running a multitenant system, which avoids the scale challenges that come from data bursts.
Building and running this system is difficult. We learn daily and provide a high-quality service to our customers. This is always our focus and our goal and we are continually improving the open source tools we use as well as our service at the same time. We expect open source to keep getting better and easier, but when dealing with scale, it’s always a learning process. We will keep learning with you and provide the best open source solution possible to the market on any cloud in any region.
Feature image via Pixabay.