Banking giant Capital One wants to bring the power of data integration directly to developers, offering a new open source ETL tool designed to expedite the process of assembling data-driven applications.
The software, called Hydrograph, can integrate data from multiple sources, a process known as extract, transform, and load (ETL). ETL draws data from multiple sources, formatting and aggregating them into an easily consumable form. The company unveiled Hydrograph Friday at the SXSW Interactive conference, currently taking place in Austin.
Although commercial ETL software packages have long been available, they tend to be more geared towards database administrators, rather than developers. The company needed greater flexibility than commercial tools can offer, in terms of adding new functionality or refining the core components. Capital One wanted developers to take control of the data preparation work because it will allow them to build out applications much more quickly, even when working with a variety of thorny, legacy back-end data sources.
Capital One built Hydrograph in house, along with help from Bitwise, a Chicago-based data management company. By open sourcing Hydrograph, released under an Apache 2.0 license, Capital One is hoping a community of like-minded users will form around the software, to share additional functionality and best practices.
Hydrograph can be used in a number of different ways, including:
- Agent call log parsing and enrichment.
- System of record data migrations.
- Data format standardization.
- Data preparation for warehouse analytics.
Hydrograph in Action
A typical Hydrograph job at Capital One might involve aggregating information from multiple data sources, which could be used to build an entirely new application. A new app could be created, for instance, that could determine which customers will overdraw an account with pending bills, allowing the company to send preemptive alerts.
Such an app would require data from three different datasets, one for customers, one for accounts and a third for bills.
With Hydrograph, data sources are picked through a built-in schema editor. Hydrograph trims the input to just the required columns. It also applies additional filtering and transformations to prep the data. Developers can filter and transform data with plug-in functionality for common tasks, such as formatting dates to a single standard. The resulting dataset can then written back to a database, or to Spark or Cascading, a platform for building applications on Hadoop.
The front end GUI was built using Eclipse’s Rich Client Platform (RCP). The GUI palette contains different components (performing ETL functions like input, output, transform, straightpull, subjob, command) that can be dragged onto the canvas to create an ETL job. Users can browse existing Hydrograph jobs with the Project Explorer.
RCP can work with many Eclipse and Java plug-ins, which should minimize the work of integrating the software into DevOps-styled pipelines.
Hydrograph provides multiple ways to execute jobs. A developer can run a job directly on a laptop, which is great for testing on small data sets. It also allows remote execution, to run larger jobs on a cluster. Hydrograph includes execution tracking that allows users to see when a component is running and when it has completed successfully. A console provides detailed job execution information.
Hydrograph is one of a number of emerging ETL tools aimed not so much at database administrators as much as for developers looking to build out cloud-native Software-as-a-Service applications. Other entrants in this integration-as-a-service space include Alooma, Fivetran, Stitch and Xplenty.
TNS will be learning more about Hydrograph and other Capital One projects this weekend at SXSW. Stop on by Capital One House at Antone’s if you are in Austin, or keep an eye out for stories and podcasts from the event in the week to come.
TNS analyst Lawrence Hecht contributed to this report.
This story was sponsored by Capital One.