Boost DevOps Maturity with a Data Lakehouse
In a world riven by macroeconomic uncertainty, businesses increasingly turn to data-driven decision-making to stay agile.
That’s especially true of the DevOps teams tasked with driving digital-fueled sustainable growth. They’re unleashing the power of cloud-based analytics on large data sets to unlock the insights they and the business need to make smarter decisions. From a technical perspective, however, that’s challenging. Observability and security data volumes are growing all the time, making it harder to orchestrate, process, analyze and turn information into insight. Cost and capacity constraints are becoming a significant burden to overcome.
Data Scale and Silos Present Challenges
DevOps teams are often thwarted in their efforts to drive better data-driven decisions with observability and security data. That’s because of the heterogeneity of the data their environments generate and the limitations of the systems they rely on to analyze this information.
Most organizations are battling cloud complexity. Research has found that 99% of organizations have embraced a multicloud architecture. On top of these cloud platforms, they’re using an array of observability and security tools to deliver insight and control — seven on average. This results in siloed data that is stored in different formats, adding further complexity.
This challenge is exacerbated by the high cardinality of data generated by cloud native, Kubernetes-based apps. The sheer number of permutations can break traditional databases.
Many teams look to huge cloud-based data lakes, a repository that stores data in its natural or raw format, to help teams centralize disparate data. A data lake enables teams to keep as much raw, “dumb” data as they wish, at relatively low cost, until teams in the business find a use for it.
When it comes to extracting insight, however, data needs to be transferred to a warehouse technology so it can be aggregated and prepared before it is analyzed. Various teams usually then end up transferring the data again to another warehouse platform, so they can run queries related to their specific business requirements.
When Data Storage Strategies Become Problematic
Data warehouse-based approaches add cost and time to analytics projects.
As many as tens of thousands of tables may need to be manually defined to prepare data for querying. There’s also the multitude of indexes and schemas needed to retrieve and structure the data and define the queries that will be asked of it. That’s a lot of effort.
Any user who wants to ask a new question for the first time will need to start from scratch to redefine all those tables and build new indexes and schemas, which creates a lot of manual effort. This can add hours or days to the process of querying data, meaning insights are at risk of being stale or are of limited value by the time they’re surfaced.
The more cloud platforms, data warehouses and data lakes an organization maintains to support cloud operations and analytics, the more money they will need to spend. In fact, the storage space required for the indexes used to support data retrieval and analysis may end up costing more than the data storage itself.
Further costs will arise if teams need technologies to track where their data is and to monitor data handling for compliance purposes. Frequently moving data from place to place may also create inconsistencies and formatting issues, which could affect the value and accuracy of any resulting analysis.
Combining Data Lakes and Data Warehouses
A data lakehouse approach combines the capabilities of a warehouse and a lake to solve the challenges associated with each architecture, thanks to its enormous scalability and massively parallel processing capabilities. With a data lakehouse approach to data retention, organizations can cope with high-cardinality data in a time- and cost-effective manner, maintaining full granularity and extra-long data retention to support instant, precise and contextual predictive analytics.
But to realize this vision, a data lakehouse must be schemaless, indexless and lossless. Being schema-free means users don’t need to predetermine the questions they want to ask of data, so new queries can be raised instantly as the business need arises.
Indexless means teams have rapid access to data without the storage cost and resources needed to maintain massive indexes. And lossless means technical and business teams can query the data with its full context in place, such as interdependencies between cloud-based entities, to surface more precise answers to questions.
Unifying Observability Data
Let’s consider the key types of observability data that any lakehouse must be capable of ingesting to support the analytics needs of a modern digital business.
- Logs are the highest volume and often most detailed data that organizations capture for analytics projects or querying. Logs provide vital insights to verify new code deployments for quality and security, identify the root causes of performance issues in infrastructure and applications, investigate malicious activity such as a cyberattack and support various ways of optimizing digital services.
- Metrics are the quantitative measurements of application performance or user experience that are calculated or aggregated over time to feed into observability-driven analytics. The challenge is that aggregating metrics in traditional data warehouse environments can create a loss of fidelity and make it more difficult for analysts to understand the relevance of data. There’s also a potential scalability challenge with metrics in the context of microservices architectures. As digital services environments become increasingly distributed and are broken into smaller pieces, the sheer scale and volume of the relationships among data from different sources is too much for traditional metrics databases to capture. Only a data lakehouse can handle such high-cardinality data without losing fidelity.
- Traces are the data source that reveals the end-to-end path a transaction takes across applications, services and infrastructure. With access to the traces across all services in their hybrid and multicloud technology stack, developers can better understand the dependencies they contain and more effectively debug applications in production. Cloud native architectures built on Kubernetes, however, greatly increase the length of traces and the number of spans they contain, as there are more hops and additional tiers such as service meshes to consider. A data lakehouse can be architected such that teams can better track these lengthy, distributed traces without losing data fidelity or context.
There are many other sources of data beyond metrics, logs, and traces that can provide additional insight and context to make analytics more precise. For example, organizations can derive dependencies and application topology from logs and traces.
If DevOps teams can build a real-time topology map of their digital services environment and feed this data into a lakehouse alongside metrics, logs and traces, it can provide critical context about the dynamic relationships between application components across all tiers. This provides centralized situational awareness that enables DevOps teams to raise queries about the way their multicloud environments work so they can understand how to optimize them more effectively.
User session data can also be used to gain a better understanding of how customers interact with application interfaces so teams can identify where optimization could help.
As digital services environments become more complex and data volumes explode, observability is certainly becoming more challenging. However, it’s also never been more critical. With a data lakehouse-based approach, DevOps teams can finally turn petabytes of high-fidelity data into actionable intelligence without breaking the bank or becoming burnt out in the effort.