TensorFlow is a hugely popular open source machine learning library. It excels at simplifying the development and training of deep neural networks using a computational model based on data flow graphs. However, Tensorflow and data science, in general, doesn’t come without challenges.
One of the difficulties frequently encountered with TensorFlow is configuring and maintaining a development environment that matches a production environment. Typically, a team of data scientists works on the same production cluster, but each data scientist requires a particular configuration and Python library. Setting up unique requirements on the same cluster involves a series of complicated, time-intensive tasks for both operators and data scientists.
Data scientists work with data locally to generate models, and then move those models and the associated data to a test or production environment. It can be tedious getting data or models from a local source to a test or production cluster. Many times a data scientist will waste valuable time copying files from node to node to container.
Installing, maintaining, scaling, and updating TensorFlow presents a set of challenges, and working with TensorFlow presents another. Mesosphere DC/OS alleviates this burden by providing a point-and-click installation for TensorFlow so development and staging environments can be set up similar enough to production that migrating code and data back and forth is trivial. Data scientists frequently rely on an Integrated Development Environment (IDE) to write TensorFlow (or Python or R) code. Jupyter Notebooks are the IDE of choice for data scientists because they allow users to create and share documents that contain live code, equations, visualizations and narrative text.
However, connecting and combining billions of records from multiple data sources and performing complex analyses is not a straightforward task, and there are still some holes in the Jupyter workflow. This is a problem felt acutely by Two Sigma, an investment company that applies the scientific method to investment management — combining massive amounts of data, world-class computing power, and financial expertise to develop sophisticated trading models.
Two Sigma open sourced a collection of kernels and extensions to the Jupyter IDE called BeakerX, to streamline data science processing, by connecting to multiple data sources to manipulate, explore, and model that data, eventually deploying an operating model into production.
Mesosphere has been working with Two Sigma for many years, primarily on building solutions for managing Two Sigma’s clustered compute environments. David Palaitis, senior vice president at Two Sigma describes how Apache Mesos facilitates data science in their environment, saying that “it’s nice to be working further up the stack now — extending Mesosphere DC/OS with an Integrated Development Environment (IDE) for data scientists. The IDE, built on Jupyter and BeakerX, abstracts the complexity of managing distributed compute. Beyond that, it makes it easier to manage the rich data analysis and machine learning software stacks required to practice data science at scale today. We’re looking forward to releasing more functionality for dataset discovery, notebook sharing and Spark cluster management into the open source BeakerX product later this year.”
BeakerX provides JVM support, interactive plots, tables, forms, publishing, and more. A real-world data science problem potentially connects and combines billions of records from multiple data sources. A BeakerX notebook could connect to those data sources to manipulate, explore and model that data. The final step involves pushing a real model into production. By using BeakerX on DC/OS, data scientists are able to work directly in a production environment, simplifying the test-to-production pipeline. For those Interested in learning more, check out this in-depth tutorial on performing data science at scale, with example notebooks and visualizations.
All pre-built DC/OS packages are easy to install, massively scalable, and consistent to deploy. The addition of BeakerX to the package catalog allows data scientists to easily work directly in a production environment, simplifying the test to production pipeline.
This article was contributed by Mesosphere, a sponsor of The New Stack.
Feature image via Pixabay.