Data / Machine Learning / Contributed

5 Ways to Make the Most of Your Data ‘Lakehouse’

18 Feb 2021 11:12am, by

The data ecosystem has significantly evolved over the last few decades — from data warehouses in the 1980s to enterprise data lakes in the early 2000s to the rise of the Lakehouse concept that combines the best of both worlds today. As described in a recent Databricks blog entry, the Lakehouse is “a new, open architecture that combines the best elements of data lakes and data warehouses … They are what you would get if you had to redesign data warehouses in the modern world, now that cheap and highly reliable storage (in the form of object stores) are available.

Here are five ways to successfully leverage Lakehouses and enable data and artificial intelligence (AI) at scale to transform your organization.

Establish Implementation Goals and Business Value

Chris D’Agostino
Chris D’Agostino joined Databricks in January 2020 as the Global Principal Technologist. In this role, Chris provides thought leadership and guidance for the C-level executives at our major customers around the world. Chris is an experienced engineering executive and entrepreneur with a successful track record in both the financial services industry and the U.S. Intelligence Community — in companies ranging from start-up phase to Fortune 100.

When starting an AI-driven digital transformation journey, most organizations typically establish an overarching goal of becoming a machine learning-first company. This is a massive objective and requires executive buy-in and support, often at the CEO and Board of Directors level.

Once you receive sign-off from these stakeholders, you can begin the transformation by implementing a modern cloud data architecture to future-proof your investment and go all-in on the cloud. Previously, organizations focused on a primary cloud vendor, but given current constraints, regulatory compliance requirements, and the competitive landscape, more companies are moving toward a multicloud solution in order to distribute and run workloads on the cloud environments of their choice.

Focusing on the technological transformation isn’t enough to solidify change — you also need to change the way your teams work. By unifying the various personas that exist within your organizations — data scientists, data engineers, business analysts and domain experts — you can enable them to work together against the same set of data to drive business value through key use cases. This cultural shift allows companies to become more data native.

Identify and Prioritize High-Value Use Cases

To ensure your data teams leverage the most high-value use cases, you need business and tech leadership alignment on the data platform architecture, AI goals and ethics, the correct mix of skill sets, and access to the right data.

Proper prioritization is a delicate balance between offensive and defensive use cases. The offensive ones are designed to increase revenue and customer acquisition while lowering operating costs. An example of an offensive use case is using AI to drive customer marketing segmentation to increase the pool of customers who are likely to convert. The defensive use cases, on the other hand, are there to guard and protect your organization from increased risk and ensure regulatory compliance — such as using AI to monitor clickstream events from mobile and web applications to identify fraud.

By regularly using a scorecard to measure the use cases in play, you can identify their strategic importance, feasibility, and ROI. It’s important to note that not every use case will have the same priority, even with the same business value, as some are more difficult to implement or require different data sets that may not be as readily available. To be successful with data and AI, you need to look for quick wins and build momentum from there.

Ease Data Governance and Compliance

The Lakehouse architecture enables you to create a single compute layer that will allow you to perform data refinement and governance in a more efficient and consistent way. By using a standard set of programming APIs, you can enable self-service data registration and automate many of the manual, error-prone steps required to create a robust data catalog and refine data. The self-servicing feature is especially critical in large enterprises where data teams are often slowed down by bureaucratic processes associated with data governance and compliance.

Data quality is also important — the goal is to centralize data quality rules so the team can execute as quickly as possible on the data as it arrives in the lake. For example, if a new data point entered in the system is a birth date, it can’t be listed as a future date because that would mean the person hasn’t been born yet. On the flip side, it also can’t be listed 140 years in the past because the person would no longer be alive. Setting up these types of enterprise-wide data quality rules on a platform that can automatically execute on them is key.

Democratize Access to Quality Data

To ensure data is high quality, data teams must also take its “time value” into consideration. When a data item first arrives in your organization, it has a very high value the moment it’s created. If you can act on a single piece of data right away — such as a suspicious credit card transaction — you can make real-time decisions and detect fraud, for example. While the value of an individual data item drops as it ages because it’s no longer relevant, aggregated older data becomes valuable again as it can reveal key trends or help you train a model to improve fraud detection or a similar process.

To democratize access to quality data, you’ll need to minimize the number of copies of data in your ecosystem. All too often, data teams work in split environments with both data lakes and enterprise data warehouses, which leads to copies of the same data residing on different platforms. I don’t recommend this approach because you’ll increase your risk scope by creating additional copies of data, and it’s very difficult to keep all your data in sync. By leveraging the lakehouse approach, you will have all your data in one place, minimize costs, and require fewer tools to get the job done.

Make Informed Build versus Buy Decisions

Once you have the right pieces of the puzzle in place, you need to weigh the pros and cons of choosing a vendor product versus a “rolling your own” solution. Any time you make the decision to deploy engineering resources to build your own solution, you are effectively becoming your own software vendor and all the responsibilities that come along with that decision. On the plus side, you’ll be able to control the overall product/platform roadmap and prioritize what features will be built next. The downside is that you’ll need to make significant investments in streamlining your CI/CD pipeline to ensure changes and fixes can be deployed quickly and provide detailed documentation and training to onboard users and new developers — keeping in mind support for the cost of the DevOps model.

From my own personal experience, your business partners and other teams that will rely on the platform to run and analyze their use cases will care less about where the platform came from and how it was built. They’re more focused on how soon it will be available, if it will support their use cases, and if it’s reliable.

Leverage the Lakehouse Architecture for Game-Changing Insights

With every company now in the business of risk management, data is the most valuable currency we have — driving business value, reducing costs and guarding against many forms of risk, some of which are unique to AI itself. In 2020, we’ve seen the payoff of the Lakehouse architecture, enabling data teams to have a more efficient data pipeline environment, compute resources that scale up and down dynamically, and a data platform that adheres to the cost and data governance policies you set. Looking ahead, once an organization operationalizes the Lakehouse architecture, its data teams can more easily focus on developing game-changing insights.

Feature image par David Mark de Pixabay.

A newsletter digest of the week’s most important stories & analyses.