Hybrid MLOps with HashiCorp Nomad

Data scientists are highly sought-after professionals who may not want to build and set up their own infrastructure in addition to their primary job duties. Put too much on their plates, and you might lose them.
That’s why the recent machine learning operations (MLOps) paradigm is helpful for organizations hoping to reduce data scientist turnover because it aims to automate the infrastructure support and CI/CD surrounding ML models end to end. Greater productivity means organizations can establish team requirements for effective data practice and stop searching for unicorns that do everything.
Once a capable team is in place, organizations need tools and workflows to drive value with MLOps implementations. A simple and flexible workload scheduler such as HashiCorp’s Nomad might seem like an odd mention amid a growing crowd in the data science tooling space. However, Nomad’s intrinsic flexibility makes it an asset for building ML pipelines in complex, hybrid environments.
I want to demonstrate where Nomad can fit amid all the moving parts of an automated ML pipeline so that the decision whether to use Nomad is easier to make.
To do this, we need an initial project (represented in Figure 1) that we can build as a proof of concept: a pipeline for training and deploying machine learning model versions.
Building an Automated MLOps Pipeline

Figure 1: a pipeline for training and deploying model versions.
The subtasks of this project are:
- Pull data from an external source.
- Transform the data into a usable form.
- Build a model in accordance with our overall ML goal.
- Train the model.
- Test the model against a subset of our data to perform a pseudo integration test on the model’s effectiveness.
- Store the model (versioned) for future reference and use it.
The end goal of this project is to build a machine learning model and save that versioned model into a cloud bucket so we can leverage it from other sources in the future. The build-learn-adjust circle represented at the far right of Figure 2 represents our desire to iterate on this design as we move forward with an end-to-end solution.
Cloud platforms and on-premises frameworks offer some model storage solutions, so there’s nothing functionally new in what I’m presenting here. However, if you’ve ever used these solutions, you know that they don’t interoperate particularly well. So, if you manage a hybrid cloud MLOps ecosystem, learning the particular ins and outs of model storage can mutate into a surprisingly daunting task. With Nomad, however, we get the benefit of orchestrating a single pipeline across environments out of the box.
Nomad’s reference architecture is a helpful guide for cluster system requirements if you decide to run with this on your own, but know that I used the smallest recommended n1 series VMs in Google Cloud (GCP). It’s up to you to either point and click the cluster into existence or use a provisioning tool such as the Terraform configuration here.
The model used doesn’t matter because we are ultimately concerned with the operations required to deploy it. With that said, it is important to note I used jsforce,Tensorflow.js and the Node.js runtime to build the necessary feature engineering, model training and model code. Then, I deployed these components into a Docker container. (You can read more about Docker as it pertains to ML modeling and testing here.)
The Nomad Config
In Nomad, users define how applications should be deployed with a declarative specification: the Nomad job. A job represents the desired application state defined in JSON or HCL format. In a Nomad job, tasks are the smallest units of deployment. A task could be a Docker container, a Java application or a batch-processing job. Each of the tasks runs in an execution environment known as a task driver. A group defines a set of tasks that must be run on the same node or VM. An allocation is an instantiation of a group running on a node. If there need to be three instances of a task group specified in the job, Nomad will create three allocations and place those accordingly.
The Nomad job specification reference is helpful for those unfamiliar with the stanza definitions. In this example, we can see that the sales-close-prediction job is a batch type and consists of a group (trainAndServeModel) with two tasks: train
and storeModel_GCP
.
The file we used to define the Nomad job specification is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
job "sales-close-prediction" { datacenters = ["dc1"] type = "batch" // Send payload parameters that are passed to top level job // Could also use a file here parameterized { meta_required = [ "save_path", "data_path", "prediction_path", "data_url", "gcp_creds", "gcp_bucket_path", "model_version" ] meta_optional = ["batch_size", "epochs"] } group "trainAndSaveModel" { count = 1 volume "models" { type = "host" source = "models" } task "train" { driver = "docker" // Expose the versioned model data via expected directory // See dockerfile for specifications volume_mount { volume = "models" destination = "/usr/src/app/models" } resources { cpu = 3054 memory = 1024 } lifecycle { hook = "prestart" sidecar = false } env { VERSION = "${NOMAD_META_model_version}_${NOMAD_ALLOC_ID}" SAVE_PATH = "${NOMAD_META_save_path}" DATA_PATH = "${NOMAD_META_data_path}" DATA_URL = "${NOMAD_META_data_url}" PREDICTION_PATH = "${NOMAD_META_prediction_path}" } config { image = "joshuanjordan/tensornode:0.7.2" } } // Storage related activity task "storeModel_GCP" { leader = true driver = "exec" env { TARGET_DIR = "${NOMAD_META_save_path}_v${NOMAD_META_model_version}_${NOMAD_ALLOC_ID}" } volume_mount { volume = "models" destination = "/tmp" } config { command = "/bin/bash" // Store the versioned model args = [ "-c", "gsutil cp /tmp/${env["TARGET_DIR"]}/model.json gs://${env["NOMAD_META_gcp_bucket_path"]}/${env["TARGET_DIR"]}.json" ] } } } |
train
runs with the Docker driver and storeModel_GCP
runs with the exec driver. As a prerequisite, Docker was installed on the Nomad clients during the provisioning process.
Because all GCP VMs are baked with the underlying SDK, there is no reason to pre-install it. However, if some other binary is needed with the exec driver, the binary needs to be on the Nomad client beforehand. The artifact stanza can be used to pull in dependencies from external sources if desired.
Under normal circumstances, Nomad guarantees that tasks grouped together will run on the same VM when the job is deployed. The tasks nested within the group will run in parallel post-deployment. You need the grouped tasks to run in sequence, so I’ve added lifecycle hooks to incorporate this functionality.
Tasks and Lifecycle Hooks
For the train task, I’ve used the prestart hook without a sidecar. This stanza tells Nomad that the train task should run before the main task starts. The sidecar = false
indicates that this task will not restart once finished. The main task, storeModel_GCP
, is identifiable to Nomad because it does not have an associated lifecycle hook.
Moreover, I’m using shared volumes and volume mounts. These are specified at the group and task levels respectively. At the group level, these are pointing to a pre-existing directory path on the Nomad client where you can share files across tasks. You can do similar things with Nomad’s built-in allocation task directory, but I wanted to use Docker because it’s a popular tool.
Leveraging a Parameterized Job
The last thing to note about the Nomad configuration is the parameterized stanza. This tells Nomad that when you run the job, you do not want to execute it immediately but, rather, invoke the job with arguments at some other time , like a function. In Nomad nomenclature, this invocation is known as dispatching; the arguments are passed as metadata. The metadata is interpreted at runtime and, therefore, you can reference it in the job file via interpolation.
Because I want the ability to run more than one of these jobs in parallel (i.e. training and storing 1000 versions of a model), I’m using the $NOMAD_ALLOC_ID
parameter as a unique identifier. If you change count = 1 to count = 2 in the group stanza, Nomad generates two allocation IDs referenced via the $NOMAD_ALLOC_ID
variable and interpolated across any of the tasks. Without going too deep into the details of the model code, know that the save path for the generated models is: models/$VERSION_$NOMAD_ALLOC_ID/model.json
.
The reason you would do this is that you don’t want 1,000 executions of this job to only store one model version. If you don’t use some unique identifier, you cannot store multiple versions of the model, and you cannot meet the primary objective of our project.
Steps for Dispatching a Nomad Job
Now that we’ve uncovered the important options of our Nomad configuration file, we can perform the necessary steps to run the job I’ve defined. We already have an environment spun up and can use the UI (Figure 3) or the CLI (Figure 4) to verify all is well.

Figure 2: UI of Nomad

Figure 3: CLI verification of Nomad status
Now that we know Nomad is up and running, we can plan the job.

Figure 4: plan salesClosePrediction job and associated output
It’s not a requirement to run the job with a version verification, but it’s a reasonable copy/paste step to follow along with.

Figure 5: nomad run (index)
As mentioned earlier, the job is parameterized, so it doesn’t execute immediately. When you run a parameterized job, you expect to pass arguments to the job when you dispatch it. I’ve abstracted the dispatch into a shell script here:
1 2 3 4 5 6 7 8 9 10 11 12 |
##!/usr/bin/env bash VERSION=$1 CREDS=$(cat /path/to/gcp/creds) nomad job dispatch \ -meta save_path=trainedWinPrediction \ -meta data_path=./opportunityHistory.json \ -meta data_url=<gcp cloud function url> \ -meta prediction_path=./predictions.txt \ -meta model_version="${VERSION}" \ -meta gcp_creds="${CREDS}" \ -meta gcp_bucket_path=<gcp-bucket-name> \ sales-close-prediction |
This crude script takes a single argument: $VERSION
. We do this because we may want to tag our dispatches with the version we are training. This is true even if we run 1,000 parallel allocations ; we still want to reference the version we ran the batches against later on.
Next, we can dispatch the job (Figure 7). Although we could monitor the allocations and the status of the job from the CLI, the Nomad UI (Figure 8) is a cleaner panel to view the progress of the job.

Figure 6: the dispatch of the job

Figure 7: the sales-close-prediction job running in the Nomad UI
Earlier we discussed how the $NOMAD_ALLOC_ID
variable is used to identify separate runs of the same model-version code. We can see this in action in Figure 9: both allocations run the same job group because they are using the same file . We specified this behavior by changing the count = 1 parameter in the Nomad configuration to count = 2 before we planned, ran and dispatched the job.

Figure 8: two allocations of the same job
By clicking on either of the allocations, we can peer into the lifecycle status (Figure 10), as well as the resource utilization for the specified allocation. Remember that a task with a prestart hook runs before the main task; this is indicated by the green shade. Once the train task finishes, the storeModel_GCP
task copies the model file, the output of train, into our GCP bucket specified in the dispatch script (Figure 11). As it pertains to compute utilization, we can view the memory footprint as the job runs and make adjustments before subsequent runs.

Figure 9: the task lifecycle (prestart) & main

Figure 10: the models deployed into Google Cloud — same version, different allocation ids
Moving Forward
At this point, we have completed our goal. The project isn’t a production-ready end-to-end pipeline, but it does illustrate why Nomad is a player in the MLOps process. In Figure 13, you can see that Nomad has a target area (pipelines) where simple orchestration is valuable and the flexibility to operate the same way across multiple environments is needed. Nomad’s unique native federation capabilities make it a better scheduler of choice for large, complex environments where you can get full visibility into the batch processing across multiple clusters, cloud regions or data centers, and can deploy and update the tasks across all environments via a single job file.
For an emerging field like MLOps where the underlying work is already difficult and tough to coordinate, using tools that do not narrow the complexity of operations doesn’t make sense. The tooling and workflows that organizations adopt must support whatever the best practices look like in the future. It’s too early to say where Nomad fits exactly in the completed puzzle of machine learning operations, but it is clear that its role is valuable if simplicity, flexibility and happier data scientists are your goal.
Some readers familiar with HashiCorp may wonder why I didn’t mention community plugins. I didn’t go this direction because I wanted to demonstrate Nomad’s capability without the need for any plugins. Any reader interested in enhancing this workflow with a plugin or toolchain is encouraged to do so.