6 Tips for Better Data Science in the Cloud
The cloud has transformed what is possible with data science. Data teams now have access to a vast pool of elastic computing power, numerous sources of internal and external data, and managed cloud services that reduce the complexity of building, training and deploying machine learning and deep learning models at scale.
But that doesn’t mean there aren’t challenges as teams adapt from an on-premises infrastructure to a cloud-based model. Data scientists, data engineers and developers are all having to learn and adapt to a new environment, and there is an ever-expanding and rapidly evolving ecosystem of tools and frameworks from which to choose. Many are learning on the job, figuring it out as they go.
The very capabilities that make the cloud so exciting also create potential pitfalls to watch out for. The ease of copying data across diverse systems can create governance challenges if not handled properly. The speed of change means that data teams can bet on the wrong tool or framework and become stranded there. Habits and biases from the on-premises world can limit understanding of what’s possible in the cloud.
After building data management technology for many years, and from frequently talking to organizations of all sizes across all industries, I’ve seen some common pitfalls and misunderstandings that can hold data teams back from doing great work. The cloud opens an exciting frontier to better understand customers, monetize data in new ways and make predictions about the future. So I hope the following tips will allow data teams to capitalize on those benefits, while working in a way that is secure, efficient and effective.
1. Make Governance Your Top Priority
It’s critical to enable iteration and investigation without compromising governance and security. For example, many data scientists intuitively want to copy a dataset before they start working on it. But it’s too easy to make copies, move on and forget they exist, creating a nightmare in terms of compliance, security and privacy. A modern data platform should allow you to work on snapshots, or virtual copies, without needing to duplicate entire datasets, while maintaining fine-grained controls to ensure that only the right users and applications have access to it. Create processes that minimize copies and clean up anything copied; don’t be the person that gets your company in the news headlines for the wrong reasons.
2. Leave Your Preconceptions at the Door
If you’re coming from an on-premises world, you’ll often bring perceptions and biases about infrastructure that no longer apply to modern platforms in the cloud. I’ve often heard data scientists say, “I’d love to retrain my model several times a day, but it’s too slow and will delay other processes.” But that’s not an issue in a world of elastic infrastructure. Approach the cloud from first principles. Start with what you want to achieve, not what you think is possible, and move forward from there. That’s the only way to push the boundaries and take full advantage of this new environment.
3. Avoid Creating Data Silos 2.0
Closely tied to data governance is the concept of silos. In the cloud, it’s important not to replicate the fragmentation that’s common in the on-premises world.. The proliferation of tools, platforms and vendors is great for innovation, but it can also lead to redundant, inconsistent data being stored in multiple locations. Another cause of fragmentation is when structured data is stored in one environment, such as a data warehouse, while semi-structured data ends up in a data lake. Besides compromising governance and security, this fragmentation can get in the way of achieving better predictions or classifications.
Work with a cloud data platform that provides a global, consolidated view of your data. That means a platform that can accommodate structured, semi-structured and unstructured data side by side and provide a single instance across multiple cloud providers and tools — not six versions of your data replicated across different platforms and environments.
4. Keep Your Options Open
One of the exciting things about this space is that frameworks and tools are evolving at an incredible pace, but it’s critical not to get locked into an approach that limits your options when technologies fall in and out of favor. To give one example: Spark ML used to be the answer to most large-scale training problems, but now TensorFlow and PyTorch are capturing the most attention. You never know what will happen next year, or next week for that matter. Choose a data platform that won’t tie you into one framework or one way of doing things, with an extensible architecture that can accommodate new tools and technologies as they come along.
5. Incorporate Third-Party Data Sources
The cloud makes it much easier to incorporate external data from partners and data-service providers into your models. This was particularly important over the past year, as businesses sought to understand how the impact of COVID-19, fluctuations in the economy, and subsequent changes in consumer behavior, would affect their businesses. For example, organizations used data about local infection rates, foot traffic in stores and signals from social media to predict buying patterns and forecast inventory needs. Explore the numerous data sources available and determine which can help to accurately address the questions your business needs to answer.
6. Minimize Complexity
It’s often said that when you have a hammer, everything looks like a nail, and this applies to AI technologies like machine learning and deep learning. They are immensely powerful and have a critical role to play for certain business needs, but they’re not right for every problem. Always start with the simplest option and increase complexity as needed. Try a simple linear regression, or look at averages and medians. How accurate are the predictions? Does the ROI of increasing the accuracy justify a more complex approach? Sometimes it does, but don’t jump to that option as your first instinct.
Doing advanced data analytics has never been more accessible. Data scientists, data engineers and developers are now among the most important members of any organization. The cloud is a simpler, more powerful and more dynamic place to do data analytics, and the challenges it presents are not hard to address when you’re aware of them and make the right decisions about technology and tools. But you need to be intentional and think before you dive in.
Starting on June 8, my company will kick off our virtual Summit, where you can join other data professionals to learn more about doing advanced analytics in the cloud. I hope you’ll join us there. In the meantime, enjoy your work and build great things.