Apache Hop, the open source metadata-based data engineering and data orchestration platform, recently was named an Apache Software Foundation top-level project.
Everything in Hop is treated as metadata. This allows it to work flexibly with hundreds of data platforms and their configuration.
The metadata describes how data should be processed or how workflows and pipelines should be orchestrated.
The project originated more than two decades ago as the Extract-Transform-Load (ETL) platform Kettle, which was acquired by Pentaho (now Hitachi Vantara) and brought to market as Pentaho Data Integration (PDI).
The software was refactored over several years, and a fork of it — the name standing for Hop Orchestration Platform — entered the Apache Incubator in September 2020.
Using a graphical user interface, data workflows and pipelines can be set up visually and described with metadata.
With its drag-and-drop graphical user interface (GUI), users don’t have to have specific programming knowledge to design, test and run workflows and pipelines. Alternatively, programmers and developers can work from the command line.
It runs in a Java environment and can be used independently of the operating system. It has been designed to work anywhere: on-premises, in the cloud, on a bare OS, in containers, IoT environments and more, on Windows, Linux, and OSX.
The Hop engine uses a kernel architecture containing only core functionality. All other functionality is added through plugins. More than 250 plugins are available with the standard installation, though you can easily add your own or third-party plugins.
Hop 1.0 was released in October, which included a massive architecture redesign and code refactoring toward its current kernel-plus-plugins architecture.
“This architecture significantly improves the development process and allows Hop to adapt to the architecture it needs to be deployed in, not the other way around,” members of the Hop Project Management Committee said in an email.
The integration with Apache Beam allows Hop pipelines and workflows to not only run on Hop’s native engine locally and remotely but also on Apache Spark, Apache Flink and Google Cloud Dataflow. This allows project teams to take their projects where the data is without any modifications to their work.
Hop supports project life cycle management through best practices, integrated version control, unit testing, support for projects and deployment environments and more.
It includes a library of integration tests and templates for metadata injection pipelines. The injection is done at runtime, reducing the need for manual development.
Hop supports multiple projects and environments. Project environments contain the configuration for a project deployment on development, test, production or other stages of your project’s life cycle.
Project files are version controlled through the git integration in Hop GUI’s file explorer, giving users options such as the ability to visually compare two versions of a workflow or pipeline.
Using the Hop GUI, developers and engineers can manage the entire project life cycle: switch between projects, environments, runtime configurations, manage git versions, etc.
“We started adopting Apache Hop in our data integration projects in early 2021 because of its flexibility, scalability and ease of use, in various scenarios ranging from classical DWH ETL processes to highly critical, real-time processes,” said Sergio Ramazzina, CEO and chief architect at Italian business analytics firm Serasoft S.r.l., and member of the Apache Hop Project Management Committee.
“We are impressed by how responsive the community is in solving issues and helping users approaching the platform — an important point to increase users’ adoption and trust.”
The Hop community continued working on the Hop 1.1.0 release throughout the graduation process, according to the committee
Hop 1.1.0 will contain work on over 200 tickets. These include numerous UI improvements and bug fixes, Apache Tika support and asynchronous web services.
In addition to extended integration with Apache Beam, work continues on more and new integration with other Apache projects like Airflow, PLC4X and others.
“In the longer term, we’ll build a marketplace where third-party plugins can easily be shared, a new GUI for improved monitoring, logging, debugging and previewing of pipelines and lots of other exciting new features the community is already working on,” the committee said.
“We consider the graduation as a top-level project as the start of an exciting new era for Hop.”
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: PDI.