DevOps Tools / Machine Learning / Open Source

Where Do Data Practitioners Prefer to Collaborate? GitHub

11 Jul 2022 11:37am, by

Two-thirds of data practitioners publicly share their data analysis or machine learning applications, according to The New Stack’s analysis of Kaggle’s latest annual survey of machine learning and data science.

Of those collaborating publicly, 76% said they do so using GitHub. Despite its critics, the platform continues to be one of the most critical parts of the tech stack for developers and non-developers building data and artificial intelligence-enabled applications.

In 2021, over 25,000 people took the survey. Since many of the participants were using the Google-owned Kaggle platform to learn how to become data scientists, The New Stack’s analysis only looked at the 17,182 respondents that reported being employed.

Of the 840 machine learning engineers in the study, 61% said they use GitHub for sharing, the highest percentage of any profession in the report to do so. While only 40 developer relations/advocates took part in the study, it is noteworthy that only 45% said they use GitHub to share their applications or analysis.

Where do you publicly share your data analysis or machine learning applications?

Data scientists, software developers and data analysts represented the largest portion of the study’s participants. Here are a few more takeaways from the study:

  • Collaboration tools built for data science, machine learning and artificial intelligence use cases did not see widespread adoption in the Kaggle survey. Of the study participants who said they collaborated publicly, a third used Kaggle itself and 20% used Colab, which is also a Google product. Since these offerings are affiliated with the survey itself, we don’t think they represent anything about the larger market.
  • Streamlit, which was bought by Snowflake earlier this year, was cited as a preferred collaboration tool by 4%. In May, Streamlit’s former CEO described the rise of data-driven apps to The New Stack.
  • Open source nbviewer and Plotly Dash, which has turned a popular open source visualization tool into a low-code platform, were two other ways data-analysis ML apps are shared.

IDEs and Collaboration

Collaboration is also taking place in and between notebooks, which have taken on a life of their own as integrated development environments (IDEs). Just like most developers, the average data practitioner uses more than one IDE, but some flavor of a Juypter or JuypterLab is most common, with Visual Studio Code placing second. Yet, many types of hosted notebooks are struggling to catch on in a crowded field:

  • More than a third of the study’s participants reported using Kaggle and Colab Notebooks. Google appears to be having success turning these users into paying customers for its other notebook and cloud offerings.
  • Eight percent are using Binder, which turns a Git repo of Juypter notebooks into an interactive live environment.
  • While overall 7% of the study said they use specific Amazon Web Services and Microsoft Azure notebook offerings. However, over 15% of AWS and Microsoft Azure cloud computing customers are also using a notebook or other AI-type solution from their cloud provider.
  • Databricks and IBM offerings were got more than passing mentions, but niche products
  • Deepnote, Code Ocean, Gradient, and Observable were each used by only 1% of the study.

We are still in the early days of data-enabled applications. Most data analysts are not interested in software licensing or which code repository they use. They want to go where the data is and where people are most likely to be sharing their models. According to Meltano, a company spun off by GitLab itself, that’s GitHub.

I could provide a huge list of low-code platforms, DataOps pipeline integrations, collaboration tools, and next-generation Airtables, many with strong followings. But few, if any of them are truly close to mass adoption. Some have reached viability as niche products, in niche industries, but only variations of Juypter notebooks and GitHub seem to be familiar enough to non-technical audiences, data pros and developers to become a breakthrough hit.

What do you think? How can the modern data stack break out of the pattern without stifling collaboration? You can reach out here.

The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Dash.