A survey of the data science tool stacks being used in the workplace reveal that businesses and enterprise value open source tools the most, and are willing to pay a premium for those able to manage data in the cloud. The bad news is that women working in data science professions are being financially penalized for no other reason than their gender.
The 2014 Data Science Survey by O’Reilly Media was released last week at Strata and Hadoop World in Barcelona, Spain. It was instantly a crowd favorite, revealing the higher wages earned by those working with a Hadoop-flavored stack. After surveying 816 data science professionals across 53 countries (although two-thirds of respondents were in the U.S.), authors John King and Roger Magoulas found that open source data tools — including those that were recently highlighted by the The New Stack — are crucial to modern stack environments while vertically-integrated, mature tool stacks from proprietary vendors were considered a detriment.
In fact, a data scientist earned an additional $1,645 for each open source tool used in the Hadoop ecosystem, while data workers using Excel, SPSS, SQL, Oracle or others could expect to earn $1,112 less for each tool.
What the survey is able to clearly show is that higher earnings capacity — and therefore perceived value to business — is linked with data scientists being willing to experiment with a range of tools that could each do the same or a similar job, with tools in the Hadoop ecosystem valued the highest. Hadoop users were comfortable with using up to 18 – 19 data science tools, several of which could be used interchangeably. King and Magoulas note that this indicates that companies are not hedging all of their bets on one data science stack, but instead expecting their professional teams to mix and match across various tools until a business decision can be made on which ones to make part of the business.
The survey groups data science tools in clusters. Each cluster was related, such that data scientists were more likely to use other tools in the same cluster. This was particularly true of the cluster of Hadoop ecosystem tools, but is also visible in the cluster related to coding analysis tools. For example, R coders were more likely to also use Python tools, social graph tools and Weka (all part of Cluster 3 as categorized by the surveyors). Coding analysis tools were also the most highly valued by industry: salaries increased $1,900 for each coding analysis tool that a data scientist used in their day-to-day data stack.
For every tool used in one of the clusters of open source stacks, the scientists were less likely to use proprietary tools like Excel or SPSS.
Open Source and Cloud Worth More to Business
Businesses valued workers able to use the open source tool stacks the most: they paid staff more when either Redis, Hadoop or Elastic MapReduce were used. Interestingly, if data scientists used relational database management services with open source tools they earned top dollar, but those who just used RDBMS tools earned much less.
Storm and Spark also emerged as two of the most popular data tools, and were used by those who tended to earn the most amongst all the data scientists surveyed.
Data scientists working in cloud environments were also more highly valued. This trend was observable across the cloud adoption lifecycle: those who were experimenting with data in the cloud earned a bump over those working solely in on-premises environments.
Only Tableau Software sat outside the value afforded to particular stack environments: staff were neither penalized nor favored for using Tableau, and using Tableau was as common amongst open source stack enthusiasts as amongst those who preferred the mature, proprietary tools.
Louis Dorard, data scientist and author of Bootstrapping Machine Learning attended Strata after having run the first International Predictive APIs and Apps conference (PAPIs.io) the days immediately prior. Amongst the 200+ attendees at PAPIs, “open source libraries such as Scikit-learn (for Machine Learning) and Pandas (for data wrangling) in Python” were most often mentioned as being used. Dorard also pointed to the data scientists love of “Command line tools (for example, those found in the Data Science Toolbox),” he said.
While Dorard acknowledged that many data scientists he meets with tend to use a variety of open source tools — often interchangeably — the big split in tool choice often comes down to which can be used to support prototyping and which are suited to production:
“In predictive modeling, there are two very different activities: prototyping models and deploying them into production. It’s difficult to be skilled at both. Prototyping requires quick iterations and is performed on the laptop, with R or Python. Usually people have their own preference depending on their background and experience. Deploying to production requires translating prototypes to another platform, but this is changing with tools that were seen at PAPIs such as Yhat and Azure ML that take your existing R/Python code and deploy it to production in a click.”
Women Shortchanged… Again
In one of the most troubling findings, the relatively new field of data science is already mimicking traditional social stratification patterns: women with the same skill sets as their male counterparts are being paid $USD13,000 less: the same gender pay inequity observed across the majority of industries in the U.S. The authors note “Gender serves as the least logical of the predictor variables, as no tool use or other factors explain the gap in pay – there seems no justification for the gender gap in the survey results.”
Gender inequities is such an issue amongst the IT workforce that an upcoming virtual programmers conference is fundraising its support dollars in aid of organizations like Women Who Code, Code2040, and Black Girls Code, all of which focus on addressing the gender and diversity inequality in the IT workforce. Hack.Summit() to be held next Monday December 1 to Thursday December 4 is a virtual conference featuring top industry talent sharing best practices across a range of programming languages. Organizer Ed Roman says the conference will aim to shine the light on the gender and minority inequities in IT fields such as data science.
How Do You Compare with Your Data Science Peers?
The following calculator has been created to show what you are likely to earn, based on the summary results of the 816 survey respondents to the Data Science Survey.
Feature image via Flickr Creative Commons