Oracle has updated its data extract transform and load (ETL) tool to work as a cloud service.
Oracle’s Data Integrator Cloud takes on open source competitors in data transformation for real-time enterprise analytics, and though it’s part of Oracle’s cloud suite, it’s not reliant on the other parts.
Integrator Cloud is Oracle’s Data Integrator Enterprise Edition repackaged for the cloud. It can be used with Oracle’s stack, other clouds or customers’ data centers and allows users to easily swap jobs between technologies such as Hive, HDFS, HBase, and Sqoop to standardize data format and syntax for large-scale analysis projects.
Designed for heterogeneous workloads, such as combining data from CRM, marketing, billing and even social media applications, the service provides native adaptors to a range of both Oracle and other applications, including Gmail, MailChimp, Facebook, DocuSign and others.
In support of end users such as business analyst and data science teams, it’s different from messaging use cases that involve moving transactions or doing transactional integration across SaaS applications such as Salesforce, according to Jeff Pollock, vice president of product management at Oracle.
“Regardless of where you decide to host your data, you can use this service to push down the processing into the location of your data without having to route your data through the Oracle cloud if you don’t want to,” he said.
Leaving Data in Place
“It allows us to execute data transformations anywhere in the customer’s architecture,” Pollock said.
In traditional ETL (extract, transform, load) architectures, datasets are copied into a central engine for transformation, then copied back out to the target locations. ELT (extract, load, transform) processing does not use a central hub — it allows the transformation to take place at the data’s destination.
The service can generate the programs to transform the data and push those algorithms out to source systems, target systems and Big Data systems in your data warehouse. This is ideal for the cloud, Pollock said, because of the network latency involved in traditional ETL operations.
It supports multiple databases, Hadoop, the Java API JDBC, XML, JSON, web services, REST, JMS, Apache Kafka and more, offering data-based, event-based, and service-based integration all in one.
The service offers a flow-based declarative user interface along with release management capabilities that allow customers to improve productivity and code management, Pollock said.
“The real secret sauce for the tool [is that] we store all the algorithms as metadata in a central location. We talk about this being a metadata-driven process. All that metadata can be put in source control, just like application development uses source control. It can integrate with source control repositories like Git or SVN (Apache Subversion) where changes are stored over time as part of the development lifecycle [for enterprise code management],” he said.
The data put in source control is processed through knowledge modules – semantic adaptors – that generate the code for transformation based on the metadata.
“These knowledge modules are what allow this framework to be decoupled from the underlying execution. The developer only has to create the logical transformation one time. It’s stored in metadata, then they can choose different knowledge modules to execute transformations,” he said.
For example, you can deploy your project the first time using a relational database engine like Oracle. Then the knowledge module will store the metadata in an Oracle-specific SQL language for doing the transformations within the Oracle database. But if a week later you decide you’d rather use a Big Data technology like Apache Hive or Spark, you can just select a different knowledge module and all your metadata stays exactly the same, he explained. You don’t have to rebuild any mappings or change a single line of code.
That can improve developer productivity because they have the ability to generate that code in a variety of different output languages like Spark Streaming, Hive, IBM Netezza database, DB2 or SQL Server without having to learn any specific calls because the knowledge modules generate the code.
Oracle already has customers doing push-down transformations in Amazon Redshift, Pollock said, and only time will tell whether customers choose to use the service independent of its cloud platform-as-a-service, which includes GoldenGate Cloud Service for real-time data warehousing, Database Cloud, Database Exadata Cloud and Big Data Cloud.
Retooled for the cloud as part of its PaaS, it’s hosted on the Oracle Java Cloud Service and offered as both metered hourly and monthly subscription options. Metered customers also can invoke a dormancy state, in which they can stop and restart the service as they choose. Deployment, provisioning and lifecycle management have been automated.
Touting Cloud Prowess
Though considered late to the game of public enterprise cloud, Oracle is ramping up its presence in the space. At the Goldman Sachs Technology and Internet Conference in San Francisco on Tuesday, Oracle co-CEO Mark Hurd told those dismissed AWS’ competitive services as “old” and Oracle’s as “fresher.”
For one thing, Oracle offers a virtualized cloud network, which AWS and Microsoft Azure don’t yet have. And late last month it acquired API vendor Apiary to boost capabilities in design and governance of APIs for cloud-based applications and services.