It has been widely reported that the modern enterprise is collecting huge amounts of data. Clearly, these enterprises will need someone to do something with all of it.
With so many headlines heralding the data scientist as the most coveted job of our time, software developers could be forgiven for thinking that the burden of wrangling big data belongs to, and only to, the mysterious and rare individuals holding that job title. In fact, over the next decade, managing and manipulating data will become a central aspect of every software developers’ job.
The reason for this is simple: Data is the new stack, or at least, its very foundation.
Here are just a few of the ways in which the data-centric world of the new stack will differ from that of previous eras of technology, and just a few of the things developers will need to do to keep up.
Data defines applications in the new world
In the old world, it was common to think of applications as “owning data.” One of the key challenges faced by application developers and architects was determining the best ways to format and store the application’s data. Sharing data between applications was typically an afterthought and required either an expensive “enterprise application integration” effort or, more recently, the development of a one-off API.
In the new world, we will look back on this idea as quaint. Data will be stored in an enterprise-wide data repository (a data reservoir, or “data lake,” if you prefer), in a system like HDFS, and all enterprise applications will access this repository as needed.
For developers, the ability to work with a wide range or data types and formats will be key, and comfort with the manipulation of unstructured data as opposed to traditional structured data types, will be especially important.
Applications move to data, not the other way around
In the old world, the key strategy for minimizing latency was to move the data closer and closer to the application via caching, replication, and other means. This strategy worked well when data volumes were small. New stack thinking turns this idea on its head, necessitated in large part by the high cost of moving around large volumes of data. In the new world, application processing is moved as close to the data as possible, an idea referred to as “data locality.”
One of the first systems to truly exploit data locality on a large scale was Hadoop, and developers needed to wrap their minds around a new computing paradigm, MapReduce, to do so. More recently, the advent of Hadoop YARN allows applications and tools built around alternative programming models to locally access Hadoop-based data. Mastering a framework, whether MapReduce or otherwise, that provides for data locality, will be important for developers building new world applications.
Math isn’t just for game devs and quants anymore
In many cases, taking full advantage of all this data will require some mathematics and statistics skills that the old world hasn’t required of most developers, at least those not in gaming, research or finance.
Machine learning platforms and statistical libraries will shoulder much of the heaviest lifting, but just as it’s important for developers to understand SQL to make best use of an ORM library, it will be incumbent upon them to grok a bit of the underlying statistics in order to effectively deliver new world applications and systems that rely on these tools.
While embracing big data can mean big change for developers, it also offers many big opportunities. And there has never been a better time to sharpen the saw as a developer, with free software, free coursework, and nearly-free computing environments around every corner.
Sam Charrington is the principal of CloudPulse Strategies, an analyst and consulting firm focusing on cloud computing, big data and related technologies and markets. He can be followed on Twitter at @samcharrington.
Image by JD Hancock