If you ask for a cup of water, is it more efficient to bring the whole barrel and give you a cup from it, or to just go fetch a cup of water?
That’s an analogy that Denodo’s Chief Marketing Officer Ravi Shankar uses to explain data virtualization.
Unlike data integration technologies that seek to create one central data pool, data virtualization connects to all the sources of data — data centers, clouds, third parties, machines — to provide answers to business questions in real time. Key to that is that it does not move the data, which is expensive and creates significant lag time in traditional ETL processes. Instead, it fetches only the relevant data.
If you want to know sales performance for the past five years and part of that information is in Teradata and part in Hadoop — you could be talking millions of rows — data virtualization returns only those rows relevant to the query. It then stitches the data together to provide the answer.
Data virtualization provides an abstraction layer that: hides complexities of where data comes from and the format it’s in; does not replicate the data, which makes delivery faster; and provides agility to companies moving to the cloud because they can switch out data sources without affecting users.
Fastest Query Route
Palo Alto, Calif.- based Denodo offers a single platform that comes in different flavors such as Agile BI, data governance, logical data lakes and logical data warehouse and that’s customized for verticals such as government, healthcare and oil and gas.
Shankar describes three generations of data virtualization:
- First generation: simply data federation. It aggregates data across various sources and returns it back to the consumer. Performance is poor. He cites solutions from IBM, Oracle, SAP and Informatica as examples of this.
- Second generation: includes performance optimization, but you have to program it at design time. Then like a GPS that doesn’t take traffic into account, it will get you there, but not fast.
- Third generation: dynamic query optimization such as Denodo uses. It’s the GPS that figures out the traffic patterns and will re-route the query for shortest possible delivery time.
Gartner points to dynamic query optimization as Denodo’s key differentiator.
For this, Denodo uses statistical cost-based optimization specifically for high data volume and complexity. During runtime, it calculates how long each query will take. Then it will optimize the query across sources. If necessary, it will rewrite the query to fetch the data as fast as possible. It’s adding in-memory processing in its upcoming new release, due out in the next quarter, to make the process even faster.
- Using automatic optimizations to minimize network traffic, pushing down as much processing as possible to the data sources
- Using parallel in-memory computation at the data virtualization layer for post-processing that cannot be pushed down to the data sources.
‘Critical’ to Big Data
Forrester calls data virtualization critical to solving the big data challenges that enterprises face. In its 2017 survey, 56 percent of global technology decision-makers said they have implemented data virtualization, are in the process or are expanding their implementations, up from 45 percent who said so in 2016.
Vendors continue to add capabilities, including analytics for social media, the Internet of Things (IoT), fraud detection and integrated insights, according to Forrester’s report. It calls the metadata catalog, which tracks all the data’s location, availability and state, integral to these services.
In a recent column for The New Stack, Denodo’s Lakshmi Randall explained how metadata and machine learning are key to automating integration.
Denodo often is seen as a startup, when it’s actually 17 years old. Denodo was founded in 1999 by Angel Viña, a professor at the University of A Coruña, Spain. Denodo moved its headquarters to Silicon Valley in 2005.
Built with SQL, users will find the Denodo interface similar to relational databases.
Data sources make up the bottom layer, including Redshift, HP Vertica, Impala, Apache Spark, Teradata, Hive and others. It supports streaming technologies such as Kafka, Storm and Spark Streaming, and integrates with Docker technology. Data virtualization resides in the middle, with business users at the top who can make use of integration with reporting tools such as Tableau and Cognos as well as custom applications.
“Base views” in the middle represent a normalized schema that is available to upper layers. Base views contain all the logic to retrieve the data from the source, using the associated SQL dialect or in the demo, an HTTP call. Base views are representations of the metadata and look just like a relational table regardless of the underlying data source technology.
The Denodo Platform uses specialized connectors to data sources to retrieve their schemas at design time and the source data at runtime for subsequent processing. Each connector uses syntax and performance optimizations created for each data source to maximize performance. These connectors are configurable through a visual editor.
Denodo has connectors to a wide range of relational and NoSQL databases, packaged applications such as SAP and Salesforce, Web Services (SOAP and REST), Excel spreadsheets, XML, JSON, log files, web applications and more. It also provides an SDK for creating custom native connectors.
The platform uses extended relational algebra, enabling users to perform complex data transformation, metadata modeling, and data quality and semantic matching operations using the SQL and relational tools they already know.
Regardless of the access method, views share metadata and access permissions.
You can publish any view as a web service via SOAP, REST or oData. JMS listeners for message queues such as ActiveMQ, Sonic and IBM Websphere MQ mean Denodo can be used as a service bus.
It provides business users with self-service search and data exploration capabilities. And a view can be exposed through all the different interfaces simultaneously. So for instance, one user can access it through JDBC while others use a REST Web Service.
“Denodo’s key strength lies in the unified and centralized data services fabric that delivers end-to-end security, lineage, transformation and orchestration across multiple traditional and big data sources. Customers like its easy-to-use, simple yet sophisticated data modeling capabilities; search capabilities; and support for various big data sources,” Forrester said in its analysis.
In addition to in-memory, the company is embracing the cloud. It’s now available on AWS and Azure and adding Google and others going forward. The company maintains latency in the cloud is on par with on-premise implementations.
And it’s expanding beyond data integration into data management.
“We are becoming the data fabric. Not just a place to aggregate data, but a place where you can search data, discover data, understand relationships, understand lineage, understand all the data governance aspects,” Shankar said. A data catalog also will be part of the new release.
Customers are maturing, so it’s stepping up its ability to support large enterprises using it as a data layer at scale.
Feature image via Pixabay
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Docker.