Kubernetes / Machine Learning

Use Kubernetes to Speed Machine Learning Development

12 Oct 2018 8:25am, by

Justin Brandenburg
Justin Brandenburg is a Data Scientist in the MapR Professional Services group. Justin has experience in a number of data areas ranging from counter-narcotics to cyber intrusion analysis. In past projects, he has utilized machine learning, econometrics, graph analytics and agent-based modeling to fulfill the customer needs. He has an undergraduate degree in Economics from Va Tech, a Masters in Economics from Johns Hopkins University and a Masters in Computational Social Science from George Mason University.

As industries shift to a microservices approach of deploying applications using containers, data scientists can reap the benefits. Data Scientists use specific frameworks and operating systems that can often conflict with the requirements of a production system. This has led to many clashes between IT and R&D departments. IT is not going to change the OS to meet the needs of a model that needs a specific framework that won’t run on RHEL 7.2.

Containers allow a data scientist to construct self-contained environments that package up necessary dependencies and logic. This also allows the data scientist a seat at the table as discussions move from DevOps to DataOps. As data arrives and is parsed for value, containers that perform specific tasks can be staged along the way, creating a machine learning workflow on new incoming data that was not possible just a few years ago.

Data scientists can deploy multiple containers to account for adjustments in the data or variations in their model. This allows for an organization to run models in parallel to evaluate and then choose which one they find more valuable because it was applied on new real-time data and not optimized on historical data.

For this example, I installed Docker and Kubernetes using kubeadm on AWS ec2 instances to create a two-node kubernetes cluster running centos 7.5:

[justin@ip-10-0-0-105 ~]$ rpm --query centos-release
[justin@ip-10-0-0-105 ~]$ python -V
Python 2.7.5
[justin@ip-10-0-0-105 ~]$ kubectl cluster-info
Kubernetes master is running at
KubeDNS is running at
[justin@ip-10-0-0-105 ~]$ kubectl get nodes

NAME                         STATUS         ROLES           AGE   VERSION

ip-10-0-0-149.ec2.internal    Ready             <none>           5h        v1.11.3
ip-10-0-0-236.ec2.internal    Ready             master            5h        v1.11.3

Data Scientists typically development, train, test and optimize their models in an R&D environment that can be configured to meet their needs. Here is a Tensorflow model I wrote in a sandbox that applies a Recurrent Neural Network on simulated time series data.

import pandas as pd
import numpy as np
import os
import random
import shutil
import tensorflow as tf
import tensorflow.contrib.metrics as metrics
import tensorflow.contrib.rnn as rnn

def main():
    rng = pd.date_range(start='1/01/2000', end='9/21/2018')
    ts = pd.Series(np.random.uniform(-10, 10, size=len(rng)), rng).cumsum()
    TS = np.array(ts)
    num_periods = 100

    f_horizon = 1  #forecast horizon, one period into the future
    x_data = TS[:(len(TS)-(len(TS) % num_periods))]
    x_batches = x_data.reshape(-1, 100, 1)
    y_data = TS[1:(len(TS)-(len(TS) % num_periods))+f_horizon]
    y_batches = y_data.reshape(-1, 100, 1)

    #number of periods per vector we are using to predict one period ahead

    num_periods = 100                 

    inputs = 1                #number of vectors submitted
    hidden = 100              #number of neurons we will recursively work through
    output = 1                #number of output vectors

    #create variable objects

    X = tf.placeholder(tf.float32, [None, num_periods, inputs])      
    y = tf.placeholder(tf.float32, [None, num_periods, output])

    #create our RNN

    model = tf.nn.rnn_cell.BasicRNNCell(num_units=hidden, activation=tf.nn.relu)   

    #choose dynamic over static

    rnn_output, states = tf.nn.dynamic_rnn(model, X, dtype=tf.float32)                     
    learning_rate = 0.001 

    #change the form into a tensor

    stacked_rnn_output = tf.reshape(rnn_output, [-1, hidden])          
    stacked_outputs = tf.layers.dense(stacked_rnn_output, output)   
    outputs = tf.reshape(stacked_outputs, [-1, num_periods, output]) 

    #define the cost function which evaluates the quality of our model

    loss = tf.reduce_sum(tf.square(outputs - y))            

    #gradient descent method

    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)              

    #train the result of the application of the cost_function

    training_op = optimizer.minimize(loss)                                   
    saver = tf.train.Saver()   #we are going to save the model
    init = tf.global_variables_initializer()
    epochs = 1000       

    with tf.Session() as sess:
            model_performance = []
            for ep in range(epochs):
            sess.run(training_op, feed_dict={X: x_batches, y: y_batches})
            mse = loss.eval(feed_dict={X: x_batches, y: y_batches})
            model_performance.append((ep, mse))
            saver.save(sess, os.path.join(DIR,"model"),global_step = epochs)

if __name__== "__main__":


I built this in the R&D environment, but now I want to move it over to the production environment. I will use Docker to build an image that I can then put my model into and deploy using Kubernetes.

First I will create a Dockerfile that will allow me to construct an image with an Ubuntu OS and install the dependencies and packages my model needs to function.

[justin@ip-10-0-0-105 ~]$ mkdir dockbuild
[justin@ip-10-0-0-105 ~]$ cd dockbuild/
[justin@ip-10-0-0-105 dockbuild]$ vi Dockerfile

FROM ubuntu:16.04
RUN apt-get update && apt-get install -y \
 build-essential \
 curl \
 git \
 libfreetype6-dev \
 libpng12-dev \
 libzmq3-dev \
 mlocate \
 pkg-config \
 python-dev \
 python-numpy \
 python-pip \
 software-properties-common \
 swig \
 zip \
 zlib1g-dev \
 libcurl3-dev \
 openjdk-8-jre-headless \
 wget \
 && \
 apt-get clean && \
 rm -rf /var/lib/apt/lists/*

RUN echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" \

 | tee /etc/apt/sources.list.d/tensorflow-serving.list

RUN curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg \

 | apt-key add -

RUN apt-get update && apt-get install -y \


RUN pip install --upgrade pip
RUN pip install pandas tensorflow tensorflow-serving-api
CMD ["/bin/bash"]

I am going to use the Tensorflow Serving API to execute and save my model within a Docker container.  Next, build the image and then run it:

[justin@ip-10-0-0-105 dockbuild]$ sudo docker build -t justin-tf_serving .
[justin@ip-10-0-0-105 dockbuild]$ sudo docker run --name=rnn_model_1 -it justin-tf_serving

root@1c48c2df62f8:/# cd
root@1c48c2df62f8:~# vi rnn_model_1.py
root@1c48c2df62f8:~# python rnn_model_1.py

Copy the model in and then run the python script. The model parameters will be saved in the /tmp/ folder within the container. To exit the container and have it keep running in the background press Ctrl+P and Ctrl+Q. I need to persist the changes I made to the justin-tf_serving container in order for my model data to permanently remain. Retrieve the container ID and commit the changes into a new image called tf_kube1.

[justin@ip-10-0-0-105 dockbuild]$ sudo docker ps -a
[justin@ip-10-0-0-105 dockbuild]$ sudo docker commit 07845b9c7ec5 tf_kube1
[justin@ip-10-0-0-105 dockbuild]$ sudo docker stop rnn_model_1

Kubernetes allows you to pull images from a private or local image hub, but for the purpose of this example, we will push and then pull our new image from Docker Hub. Lock in with your username and password.

[justin@ip-10-0-0-105 dockbuild]$ sudo docker login --username=jbrandenburg
Login Succeeded
[justin@ip-10-0-0-105 dockbuild]$ sudo docker tag tf_kube1 jbrandenburg/kube-example
[justin@ip-10-0-0-105 dockbuild]$ sudo docker push jbrandenburg/kube-example

Once our image is on Docker Hub we need to specify how we want Kubernetes to use the image on our cluster.  We do this via a .yaml file. We setting up a deployment of containers that will also be running as a service.

[justin@ip-10-0-0-105 dockbuild]$ vi kube_example.yaml apiVersion: v1 kind: Deployment metadata:   name: tfrnn-deployment spec:   replicas: 3   template:             metadata:             labels:             app: tfrnn-server             spec:             containers:             - name: rnn-model-1             image: jbrandenburg/kube-example             command:             - /bin/sh             args:             - -c             - tensorflow_model_server --model_name=model --model_base_path=/tmp/model             ports:             - containerPort: 8500 --- apiVersion: v1 kind: Service metadata:   labels:             run: tfrnn-service   name: tfrnn-service spec:   ports:   - port: 8500             targetPort: 8500   selector:             app: tfrnn-server   type: LoadBalancer Create the Kubernetes objects: [justin@ip-10-0-0-105 dockbuild]$ kubectl get nodes [justin@ip-10-0-0-105 dockbuild]$ kubectl create -f kube_example.yaml deployment.extensions/tfrnn-deployment created service/tfrnn-service created [justin@ip-10-0-0-105 dockbuild]$ kubectl get deployments NAME             DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE tfrnn-deployment   3            3          3          3          23s [justin@ip-10-0-0-105 dockbuild]$ kubectl get pods NAME                                     READY           STATUS         RESTARTS   AGE tfrnn-deployment-868f55dd5-7s4tw   1/1              Running   0                42s tfrnn-deployment-868f55dd5-mnvlb   1/1             Running   0                42s tfrnn-deployment-868f55dd5-qf5j6   1/1   Running   0                42s [justin@ip-10-0-0-105 dockbuild]$ kubectl get services NAME             TYPE               CLUSTER-IP EXTERNAL-IP   PORT(S)                 AGE kubernetes    ClusterIP         <none>           443/TCP          1m tfrnn-service   LoadBalancer   <pending>             8500:31723/TCP   1m [justin@ip-10-0-0-105 dockbuild]$ kubectl describe service tfrnn-service Name:                   tfrnn-service Namespace:              default Labels:                 run=tfrnn-service Annotations:            <none> Selector:               app=tfrnn-server Type:                   LoadBalancer IP:           Port:                   <unset>  8500/TCP TargetPort:             8500/TCP NodePort:               <unset>  31723/TCP Endpoints:                Session Affinity:       None External Traffic Policy:  Cluster Events:                   <none> Log in to one of the three pods that we just instantiated: [justin@ip-10-0-0-105 dockbuild]$ kubectl exec -it tfrnn-deployment-868f55dd5-7s4tw -- /bin/bash root@tfrnn-deployment-868f55dd5-7s4tw:/# ls /tmp/model/ checkpoint  model-1000.data-00000-of-00001  model-1000.index  model-1000.meta

Kubernetes is now running our Docker image that contained the trained Tensorflow model we created. Now we can push new data through the model and our model can evaluate this data and give us our results. We could have created second images with adjustments in the model hyperparameters and our pods could be running Model A and Model B side by side to compare results.

Our models have all they need to run in their containers. The containers are configured to run in the production environment. Kubernetes will let us specify resources to improve efficiency in the compute allocated to our models and will let us know if a container is not performing as it should.

As recently as two years ago, once I had performed my analysis and gained insight from data, I was never able to take the next step and deploy this insight. I would write a report, send an email, or present some slides, but my value was limited to only what decision makers would do with it. Transferring my workflow logic and model into a production-ready application required the approval of many people and the dedication of a software developer. In a dynamic industry, this lag could allow the data to change which would the model results to be less meaningful.

With developments in containers and Kubernetes, this doesn’t need to be the case any longer. The value of data science is determined by the insight it gives into data. This value can only increase as the ability to solve challenges in real time becomes more available.

Feature image via Pixabay.

A digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.