Future-Proof Your Big Data Processing with Apache Beam
I was sitting there at Strata SJC 2016 eating lunch and chatting with two engineers. They were talking about how they’re excited to go back to work and rewrite their system to use the new frameworks they’ve just learned about. The engineer in me thinks “awesome!” The business person in me thinks “I hope they run it by their manager first. That’s a massive time sink and probably a waste of time.”
I seriously doubt they ran it by their manager first. They probably spent a few days writing and debugging code that won’t help the business’ bottom line.
You might sit back and think “Our engineers are more disciplined than that. We’d never have to rewrite code for a new platform.” If your organization went from MapReduce to Spark, then, yes, your team rewrote code for a new platform. And they shouldn’t have had to.
Stopping the Rewrites
Your team’s code was written to directly use the MapReduce API. Their rewrite to use Spark was, in turn, written directly to Spark’s API. The next time there is a new platform, there will be another direct API rewrite and so on.
How do we stop rewriting code for a new platform? We start using an intermediary API and stop writing directly to an API.
This is where Apache Beam comes in. You only write to its API and then choose a technology to run the code on. The actual means of writing the code and executing the code is decoupled or separated.
Beam would change the lunch conversation above to “I learned about this new technology. I’m going to make a configuration change and see if our jobs run faster.” You’re no longer doing rewrites to change from one technology to another.
Placing Difficult Bets
Technical leaders are faced with a difficult decision of placing a long-term bet on rapidly changing technologies. We’re already seeing a transition from Hadoop MapReduce to Spark. It wasn’t that those who chose Hadoop MapReduce made a strategic mistake. A better batch processing engine came along later and they were forced to rewrite.
Now leaders face another difficult question. Which batch or streaming processing engine should I use? Should you use Spark, Flink, Storm, or another up and coming technology? This is where Beam becomes even more interesting. It uses the same API for both batch and stream (real-time) processing.
Which one of these technologies is the next big thing? Which one will be the one that gains a community, general acceptance, and lives on? I couldn’t tell you, but I don’t have to tell you now. As these new technologies come out, you make configuration changes and move to the new framework.
One API to Rule Them All
Simplicity saves businesses time and money. Think about how many APIs your engineers need to use to write data processing jobs. They’ll need to know Hadoop MapReduce API, Spark API, Flink API, and any other APIs. By having a single Beam API, your engineers only need to know one API. This same API can handle small data processing of MBs all the way up to large data processing of TBs and PBs.
Engineers will only need to know how each framework like Hadoop MapReduce or Spark works in a conceptual sense. They’ll need to understand the general tradeoffs between each system. However, the coding for each one will be the same. That will result in a big productivity increase.
I can’t tell what’s going to happen in the future for Big Data. This is especially true for streaming frameworks. We used to have to predict what’s coming next to prevent the next rewrite. Beam allows us to future-proof without worrying about the next big thing.
Feature image via Pixabay.