The biggest obstacle when deploying a model to production is typically not creating the model at all, but creating a data pipeline that can deliver fresh data to that model. Rasgo was founded in 2020 to solve this problem by making it easier for data scientists to prepare data in a production-ready manner, leveraging the cloud data warehouse.
After working with hundreds of data scientists across a multitude of industries, my co-founder and I consistently saw data scientists frustrated by the cumbersome effort associated with data prep for ML. The vast majority of their time was spent extracting, exploring, cleansing, joining and transforming data — rather than developing models and solving tough problems.
Before Rasgo, I started my career as a solo data scientist… that is, I worked on a team with great colleagues in marketing and ops, but without other data engineers, data analysts or data scientists to partner with. In this role, I quickly found myself devoting more time to keeping data pipelines up and running than I was able to spend on net new statistical modeling and analysis. One of the biggest challenges I faced was running long python scripts that were built in pandas so that I could rescore models and update dashboards that my marketing and ops teammates used to make daily decisions.
All my extract and transformation scripts were built in Python, but the only production-ready environment I had access to supported SQL. Sound familiar? The good news is that three things have changed since then:
- Centralization of data has made a full-scale comeback, propelled by Snowflake and BigQuery (RIP, Hadoop).
- Fivetran / Airbyte have primarily solved the ‘E’ part of ELT (Extract, Load, Transform) unless you’re a Global 2000 enterprise with a large number of on-prem data sources.
- Instead of extracting data from your data warehouse into a compute environment, we bring compute to our data and run most transformation workloads in-warehouse (shout out to dbt).
At ScaleUp:AI, Rasgo, Snowflake and Data Robot will discuss this problem in more detail, focused on a particularly ubiquitous model of Customer Lifetime Value, or CLV. Predicting CLV is valuable, but using that prediction to drive intelligent decisions in marketing, sales, and supply chain is even more valuable.
So, what’s missing? The Modern Data Stack leaves one user out in the cold… the Python data scientist. This user lives in a Jupyter notebook and is great at scripting data transformation and feature engineering in pandas or dask. SQL is not their language of choice; ‘select * from isn’t scary, but generating hundreds of features with window functions, CTEs, self-joins, etc… is both overwhelming and inefficient.
We propose RasgoQL to be the bridge for the Python user into the Modern Data Stack. Using RasgoQL in a Jupyter notebook (or a Hex notebook, if you’re fancy), the python data scientist can write pandas-like transformation code against tables or views, quickly generating hundreds of lines of SQL that will run directly in your Snowflake, BigQuery or Postgres data warehouse (with more data warehouse support coming soon). The best part? In one line of code, you can export this SQL to your dbt project so that it can run in production alongside other data pipelines.
Now, not all Python transformations can be compiled as pure SQL code… Yet. As data warehouses improve their support for python UDFs, RasgoQL will enable easy bundling of transform code that stays in python alongside your other transforms so that the whole transform chain can be orchestrated together.
We can’t wait to get your feedback on RasgoQL. At Rasgo, we believe in a future where all members of your organization can generate insights from data in less than five minutes, and RasgoQL is a powerful start.
The New Stack’s parent company Insight Partners is hosting the ScaleUp: AI conference, April 6-7 alongside partner Citi and the AI industry’s most transformational leaders. Bringing the visionaries, luminaries, and doers of AI and innovation together, this hybrid conference will unlock ideas, solve real business challenges, and illustrate why we are in the middle of the AI ScaleUp revolution — and how to turn it into commercial reality. RSVP today to access early bird pricing and receive an additional TNS discount on top: Use the code TNS25.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Rasgo.
Snowflake is a sponsor of The New Stack.
Feature image via Pixabay.