Tutorial: Edge AI with Triton Inference Server, Kubernetes, Jetson Mate

In this tutorial, we will configure and deploy Nvidia Triton Inference Server on the Jetson Mate carrier board to perform inference of computer vision models. It builds on our previous post where I introduced Jetson Mate from Seeed Studio to run the Kubernetes cluster at the edge.
Though this tutorial focuses on Jetson Mate, you can use one or more Jetson Nano Developer Kits connected to a network switch to run the Kubernetes cluster.
Step 1: Install K3s on Jetson Nano System-on-Modules (SoMs)
Assuming you have installed and configured JetPack 4.6.x on all the four Jetson Nano 4GB modules, let’s start with the installation of K3s.
The first step is to turn Nvidia Container Toolkit into the default runtime for Docker. To do this, add the line "default-runtime": "nvidia"
to the file, /etc/docker/daemon.json
on each node. This is important because we want K3s to access the GPU available on Jetson Nano.
SSH into the module inserted into the master slot, and run the below command to install K3s control plane.
1 2 3 4 5 6 7 |
mkdir -p ~/.kube curl -sfL https://get.k3s.io | \ K3S_TOKEN=jetsonmate \ K3S_KUBECONFIG_MODE="644" \ INSTALL_K3S_EXEC="--docker --disable servicelb --disable traefik" \ K3S_KUBECONFIG_OUTPUT="$HOME/.kube/config" \ sh - |
By setting the --docker
switch, we are forcing K3s to use Nvidia container runtime instead of the default containerd runtime. I am also disabling the Servicelb load balancer and Traefik service mesh on the cluster to avoid network-related issues.
Once the control plane is installed, verify if the K3s service is up and running.
Run the below command on the remaining three nodes to configure the cluster:
1 2 3 4 5 |
curl -sfL https://get.k3s.io | \ K3S_TOKEN=jetsonmate \ K3S_URL="https://jm-node-1:6443" \ INSTALL_K3S_EXEC="--docker" \ sh - |
By the end of this step, we should have a four-node Kubernetes cluster based on K3s.
Step 2: Configure and Mount NFS Share in the Cluster
We are using NFS as the shared backend to store models accessed by the Triton Inference Server. The NFS client provisioner Helm chart helps us in exposing an existing NFS share to Kubernetes.
1 |
helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner |
1 2 3 4 |
helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \ --set nfs.server=10.0.0.30 \ --set nfs.path=/var/nfs/general \ --set storageClass.provisionerName=k8s-sigs.io/nfs-subdir-external-provisioner |
Replace the NFS server IP address with your server.
This Helm chart creates a storage class that supports the dynamic provisioning of shared PVCs.
Step 3: Create a Shared PVC and Add a TensorFlow Model
With the NFS share ready, let’s configure the shared PVC to which we can add models accessed by Triton Inference Server.
1 2 3 4 5 6 7 8 9 10 11 |
kind: PersistentVolumeClaim apiVersion: v1 metadata: name: models-pvc spec: storageClassName: nfs-client accessModes: - ReadWriteMany resources: requests: storage: 10Gi |
Notice the storageClassName
and the accessModes
values in the specification. They help us create the shared PVC accessible to multiple Triton pods.
1 |
kubectl create -f models-pvc.yaml |
When you access the directory exposed as an NFS share, you will see a directory that matches the name of the PVC.
Let’s populate this directory with the Inception model, label file, and the configuration file expected by Triton Inference Server.
On the NFS server, run the following commands to download the model and moving it to the NFS share.
1 2 3 4 5 6 |
wget -O /tmp/inception_v3_2016_08_28_frozen.pb.tar.gz \ https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz cd /tmp tar xzf inception_v3_2016_08_28_frozen.pb.tar.gz mv inception_v3_2016_08_28_frozen.pb /var/nfs/general/default-models-pvc-pvc-d330a39d-0b96-42ac-8839-863800c2b924/inception_graphdef/1/model.graphdef |
On my file server, the directory /var/nfs/general/
is exported as a NFS share. Replace that with your own path.
Next, let’s download the labels and the configuration file from the Triton GitHub repository.
1 |
wget -O /var/nfs/general/default-models-pvc-pvc-d330a39d-0b96-42ac-8839-863800c2b924/inception_graphdef/config.pbtxt https://raw.githubusercontent.com/triton-inference-server/server/main/docs/examples/model_repository/inception_graphdef/config.pbtxt |
1 |
wget -O /var/nfs/general/default-models-pvc-pvc-d330a39d-0b96-42ac-8839-863800c2b924/inception_graphdef/inception_labels.txt https://raw.githubusercontent.com/triton-inference-server/server/main/docs/examples/model_repository/inception_graphdef/inception_labels.txt |
This directory structure enables Triton to access and load models from the shared backend.
Step 4: Building the Triton Inference Server Docker Image
As of April 2022, Nvidia doesn’t have an official container image for Triton targeting Jetson devices. To deploy it on Kubernetes, we will have to build our own Docker image.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
FROM nvcr.io/nvidia/l4t-base:r32.6.1 RUN apt-get update && \ apt-get install -y --no-install-recommends \ software-properties-common \ autoconf \ automake \ build-essential \ cmake \ git \ libb64-dev \ libre2-dev \ libssl-dev \ libtool \ libboost-dev \ libcurl4-openssl-dev \ libopenblas-dev \ rapidjson-dev \ patchelf \ zlib1g-dev && \ apt-get autoclean && \ apt-get autoremove RUN mkdir -p /opt/triton && \ wget https://github.com/triton-inference-server/server/releases/download/v2.17.0/tritonserver2.17.0-jetpack4.6.tgz && \ tar xf tritonserver2.17.0-jetpack4.6.tgz -C /opt/triton && \ rm tritonserver2.17.0-jetpack4.6.tgz ENV PATH="/opt/triton/bin:$PATH" ENV LD_LIBRARY_PATH="/opt/triton/lib:$LD_LIBRARY_PATH" ENTRYPOINT ["tritonserver", "--backend-directory=/opt/triton/backends"] |
This Dockerfile
builds the image from nvcr.io/nvidia/l4t-base:r32.6.1
base image optimized for JetPack 4.6.
Build, tag, and push this image to your favorite repository.
Step 5: Deploying Triton Inference Server on K3s
Triton Inference Server takes advantage of the GPU available on each Jetson Nano module. But, only one instance of Triton can use the GPU at a time. To ensure that we run one and only one instance of the Triton pod, we will configure it as a daemonset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
apiVersion: apps/v1 kind: DaemonSet metadata: name: triton labels: app: triton spec: selector: matchLabels: app: triton template: metadata: labels: app: triton spec: containers: - name: triton-jetson image: janakiramm/triton-jetson args: ["--min-supported-compute-capability=5.3", "--model-repository=/opt/triton/models", "--backend-config=tensorflow,version=2"] volumeMounts: - name: varlog mountPath: /var/log - name: varlibdockercontainers mountPath: /var/lib/docker/containers readOnly: true ports: - containerPort: 8000 name: http - containerPort: 8001 name: grpc - containerPort: 8002 name: metrics volumeMounts: - mountPath: /dev/shm name: dshm - mountPath: /opt/triton/models name: models volumes: - name: dshm emptyDir: medium: Memory - name: models persistentVolumeClaim: claimName: models-pvc --- apiVersion: v1 kind: Service metadata: name: triton spec: type: NodePort selector: app: triton ports: - protocol: TCP name: http port: 8000 nodePort: 30800 targetPort: 8000 - protocol: TCP name: grpc port: 8001 nodePort: 30801 targetPort: 8001 - protocol: TCP name: metrics nodePort: 30802 port: 8002 targetPort: 8002 |
Notice that the shared PVC, models-pvc
is mounted at /opt/triton/models
for Triton to access the artifacts.
Deploy the daemonset and check if the pods are running and services are exposed.
The endpoints exposed by Triton pods are used by the clients for inference.
Step 6: Performing Inference from Python
Download the Python image client example code from Triton GitHub repository. To install the required modules, create a requirements.txt
file with the below content and use pip3
to install.
1 2 3 4 5 6 7 8 |
pillow numpy attrdict tritonclient google-api-python-client grpcio geventhttpclient boto3 |
Download a picture of a car, and run the image client by sending the picture as a parameter.
1 |
python3 image_client.py -u 10.0.0.30:30800 -m inception_graphdef -s INCEPTION -x 1 -c 1 car.jpg |
Replace the IP address with the correct NodePort of your cluster. You should see the below output:
This concludes the end-to-end tutorial on installing and configuring Triton Inference Server on a Jetson Nanocluster running K3s.