Use Kubernetes to Speed Machine Learning Development

As industries shift to a microservices approach of deploying applications using containers, data scientists can reap the benefits. Data Scientists use specific frameworks and operating systems that can often conflict with the requirements of a production system. This has led to many clashes between IT and R&D departments. IT is not going to change the OS to meet the needs of a model that needs a specific framework that won’t run on RHEL 7.2.
Containers allow a data scientist to construct self-contained environments that package up necessary dependencies and logic. This also allows the data scientist a seat at the table as discussions move from DevOps to DataOps. As data arrives and is parsed for value, containers that perform specific tasks can be staged along the way, creating a machine learning workflow on new incoming data that was not possible just a few years ago.
Data scientists can deploy multiple containers to account for adjustments in the data or variations in their model. This allows for an organization to run models in parallel to evaluate and then choose which one they find more valuable because it was applied on new real-time data and not optimized on historical data.
For this example, I installed Docker and Kubernetes using kubeadm on AWS ec2 instances to create a two-node kubernetes cluster running centos 7.5:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
[justin@ip-10-0-0-105 ~]$ rpm --query centos-release centos-release-7-5.1804.el7.centos.2.x86_64 [justin@ip-10-0-0-105 ~]$ python -V Python 2.7.5 [justin@ip-10-0-0-105 ~]$ kubectl cluster-info Kubernetes master is running at https://10.0.0.105:6443 KubeDNS is running at https://10.0.0.105:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy [justin@ip-10-0-0-105 ~]$ kubectl get nodes NAME STATUS ROLES AGE VERSION ip-10-0-0-149.ec2.internal Ready <none> 5h v1.11.3 ip-10-0-0-236.ec2.internal Ready master 5h v1.11.3 |
Data Scientists typically development, train, test and optimize their models in an R&D environment that can be configured to meet their needs. Here is a Tensorflow model I wrote in a sandbox that applies a Recurrent Neural Network on simulated time series data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
import pandas as pd import numpy as np import os import random import shutil import tensorflow as tf import tensorflow.contrib.metrics as metrics import tensorflow.contrib.rnn as rnn def main(): random.seed(111) rng = pd.date_range(start='1/01/2000', end='9/21/2018') ts = pd.Series(np.random.uniform(-10, 10, size=len(rng)), rng).cumsum() TS = np.array(ts) num_periods = 100 f_horizon = 1 #forecast horizon, one period into the future x_data = TS[:(len(TS)-(len(TS) % num_periods))] x_batches = x_data.reshape(-1, 100, 1) y_data = TS[1:(len(TS)-(len(TS) % num_periods))+f_horizon] y_batches = y_data.reshape(-1, 100, 1) #number of periods per vector we are using to predict one period ahead num_periods = 100 inputs = 1 #number of vectors submitted hidden = 100 #number of neurons we will recursively work through output = 1 #number of output vectors #create variable objects X = tf.placeholder(tf.float32, [None, num_periods, inputs]) y = tf.placeholder(tf.float32, [None, num_periods, output]) #create our RNN model = tf.nn.rnn_cell.BasicRNNCell(num_units=hidden, activation=tf.nn.relu) #choose dynamic over static rnn_output, states = tf.nn.dynamic_rnn(model, X, dtype=tf.float32) learning_rate = 0.001 #change the form into a tensor stacked_rnn_output = tf.reshape(rnn_output, [-1, hidden]) stacked_outputs = tf.layers.dense(stacked_rnn_output, output) outputs = tf.reshape(stacked_outputs, [-1, num_periods, output]) #define the cost function which evaluates the quality of our model loss = tf.reduce_sum(tf.square(outputs - y)) #gradient descent method optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate) #train the result of the application of the cost_function training_op = optimizer.minimize(loss) saver = tf.train.Saver() #we are going to save the model DIR="/tmp/model/" init = tf.global_variables_initializer() epochs = 1000 with tf.Session() as sess: init.run() model_performance = [] for ep in range(epochs): sess.run(training_op, feed_dict={X: x_batches, y: y_batches}) mse = loss.eval(feed_dict={X: x_batches, y: y_batches}) model_performance.append((ep, mse)) saver.save(sess, os.path.join(DIR,"model"),global_step = epochs) if __name__== "__main__": main() |
I built this in the R&D environment, but now I want to move it over to the production environment. I will use Docker to build an image that I can then put my model into and deploy using Kubernetes.
First I will create a Dockerfile that will allow me to construct an image with an Ubuntu OS and install the dependencies and packages my model needs to function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
[justin@ip-10-0-0-105 ~]$ mkdir dockbuild [justin@ip-10-0-0-105 ~]$ cd dockbuild/ [justin@ip-10-0-0-105 dockbuild]$ vi Dockerfile FROM ubuntu:16.04 RUN apt-get update && apt-get install -y \ build-essential \ curl \ git \ libfreetype6-dev \ libpng12-dev \ libzmq3-dev \ mlocate \ pkg-config \ python-dev \ python-numpy \ python-pip \ software-properties-common \ swig \ zip \ zlib1g-dev \ libcurl3-dev \ openjdk-8-jdk\ openjdk-8-jre-headless \ wget \ && \ apt-get clean && \ rm -rf /var/lib/apt/lists/* RUN echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" \ | tee /etc/apt/sources.list.d/tensorflow-serving.list RUN curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg \ | apt-key add - RUN apt-get update && apt-get install -y \ tensorflow-model-server RUN pip install --upgrade pip RUN pip install pandas tensorflow tensorflow-serving-api CMD ["/bin/bash"] |
I am going to use the Tensorflow Serving API to execute and save my model within a Docker container. Next, build the image and then run it:
1 2 3 4 5 6 |
[justin@ip-10-0-0-105 dockbuild]$ sudo docker build -t justin-tf_serving . [justin@ip-10-0-0-105 dockbuild]$ sudo docker run --name=rnn_model_1 -it justin-tf_serving root@1c48c2df62f8:/# cd root@1c48c2df62f8:~# vi rnn_model_1.py root@1c48c2df62f8:~# python rnn_model_1.py |
Copy the model in and then run the python script. The model parameters will be saved in the /tmp/ folder within the container. To exit the container and have it keep running in the background press Ctrl+P and Ctrl+Q. I need to persist the changes I made to the justin-tf_serving container in order for my model data to permanently remain. Retrieve the container ID and commit the changes into a new image called tf_kube1.
1 2 3 |
[justin@ip-10-0-0-105 dockbuild]$ sudo docker ps -a [justin@ip-10-0-0-105 dockbuild]$ sudo docker commit 07845b9c7ec5 tf_kube1 [justin@ip-10-0-0-105 dockbuild]$ sudo docker stop rnn_model_1 |
Kubernetes allows you to pull images from a private or local image hub, but for the purpose of this example, we will push and then pull our new image from Docker Hub. Lock in with your username and password.
1 2 3 4 5 |
[justin@ip-10-0-0-105 dockbuild]$ sudo docker login --username=jbrandenburg Password: Login Succeeded [justin@ip-10-0-0-105 dockbuild]$ sudo docker tag tf_kube1 jbrandenburg/kube-example [justin@ip-10-0-0-105 dockbuild]$ sudo docker push jbrandenburg/kube-example |
Once our image is on Docker Hub we need to specify how we want Kubernetes to use the image on our cluster. We do this via a .yaml file. We setting up a deployment of containers that will also be running as a service.
1 |
[justin@ip-10-0-0-105 dockbuild]$ vi kube_example.yaml apiVersion: v1 kind: Deployment metadata: name: tfrnn-deployment spec: replicas: 3 template: metadata: labels: app: tfrnn-server spec: containers: - name: rnn-model-1 image: jbrandenburg/kube-example command: - /bin/sh args: - -c - tensorflow_model_server --model_name=model --model_base_path=/tmp/model ports: - containerPort: 8500 --- apiVersion: v1 kind: Service metadata: labels: run: tfrnn-service name: tfrnn-service spec: ports: - port: 8500 targetPort: 8500 selector: app: tfrnn-server type: LoadBalancer Create the Kubernetes objects: [justin@ip-10-0-0-105 dockbuild]$ kubectl get nodes [justin@ip-10-0-0-105 dockbuild]$ kubectl create -f kube_example.yaml deployment.extensions/tfrnn-deployment created service/tfrnn-service created [justin@ip-10-0-0-105 dockbuild]$ kubectl get deployments NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE tfrnn-deployment 3 3 3 3 23s [justin@ip-10-0-0-105 dockbuild]$ kubectl get pods NAME READY STATUS RESTARTS AGE tfrnn-deployment-868f55dd5-7s4tw 1/1 Running 0 42s tfrnn-deployment-868f55dd5-mnvlb 1/1 Running 0 42s tfrnn-deployment-868f55dd5-qf5j6 1/1 Running 0 42s [justin@ip-10-0-0-105 dockbuild]$ kubectl get services NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 1m tfrnn-service LoadBalancer 10.106.86.19 <pending> 8500:31723/TCP 1m [justin@ip-10-0-0-105 dockbuild]$ kubectl describe service tfrnn-service Name: tfrnn-service Namespace: default Labels: run=tfrnn-service Annotations: <none> Selector: app=tfrnn-server Type: LoadBalancer IP: 10.106.86.19 Port: <unset> 8500/TCP TargetPort: 8500/TCP NodePort: <unset> 31723/TCP Endpoints: Session Affinity: None External Traffic Policy: Cluster Events: <none> Log in to one of the three pods that we just instantiated: [justin@ip-10-0-0-105 dockbuild]$ kubectl exec -it tfrnn-deployment-868f55dd5-7s4tw -- /bin/bash root@tfrnn-deployment-868f55dd5-7s4tw:/# ls /tmp/model/ checkpoint model-1000.data-00000-of-00001 model-1000.index model-1000.meta |
Kubernetes is now running our Docker image that contained the trained Tensorflow model we created. Now we can push new data through the model and our model can evaluate this data and give us our results. We could have created second images with adjustments in the model hyperparameters and our pods could be running Model A and Model B side by side to compare results.
Our models have all they need to run in their containers. The containers are configured to run in the production environment. Kubernetes will let us specify resources to improve efficiency in the compute allocated to our models and will let us know if a container is not performing as it should.
As recently as two years ago, once I had performed my analysis and gained insight from data, I was never able to take the next step and deploy this insight. I would write a report, send an email, or present some slides, but my value was limited to only what decision makers would do with it. Transferring my workflow logic and model into a production-ready application required the approval of many people and the dedication of a software developer. In a dynamic industry, this lag could allow the data to change which would the model results to be less meaningful.
With developments in containers and Kubernetes, this doesn’t need to be the case any longer. The value of data science is determined by the insight it gives into data. This value can only increase as the ability to solve challenges in real time becomes more available.
Feature image via Pixabay.