Kubeflow Training Operators and Istio: solving the proxy sidecar lifecycle problem for AI/ML workloads

With Kubeflow gaining traction in the community and its early adoption in enterprises, security and observability concerns become more and more important. Many organizations that are running AI/ML workloads, operate with sensitive personal or financial data and have stricter requirements for data encryption, traceability, and access control. Quite often, we can see the use of the Istio service mesh for solving these problems and gaining other benefits of the rich functionality it provides.

Kubeflow relies on Istio for traffic routing, authorization policies, and user access control. However, at the moment of writing, it did not fully support Istio for the workloads running on top of it. This post covers architectural and design issues specific to running Kubeflow workloads on Istio and focuses on specific problems of the AI/ML training jobs: TFJob, PyTorchJob, and alike. In the end, the post presents a reference implementation of the Istio Aux Controller - an auxiliary Kubernetes Operator that helps to solve these problems in a fully automated manner.

Istio

High-level architecture

It is important to have a basic understanding of how Istio is designed at a high level.

Official documentation provides an in-depth overview of all components but for the purpose of this post, we will be focusing mostly on the data plane.

Istio Architecture

Image source: Istio architectrue documentation

Istio Control Plane injects Envoy proxies as sidecar containers running alongside the payload containers in the same pod. Once the proxy is up and running, it starts managing all network communication between pods in the mesh and also receiving configuration updates from the Control Plane. All the access policies and traffic routes are configured via Control Plane and then enforced by proxies.

To enable sidecar injection at the namespace level, the namespace should have istio-injection: enabled label

Sidecar injection

Let’s take a deeper look into the timeline of the events when Istio injection is enabled and a new Pod is being created:

The Istio CNI plugin configures Pod’s iptables to route all traffic to the Proxy.
If there are any initContainers specified, they start and must complete prior to starting the payload and sidecar containers.
Payload and sidecar containers start.

The network availability issue

While the injection model looks straightforward, there’s one major design flaw here - the Pod network is unreachable until the proxy sidecar starts. Let’s revisit the timeline from this perspective:

Istio CNI plugin configures routing of all traffic to a non-existent proxy (network becomes unavailable)
initContainers run
Payload and sidecar containers start
Proxy starts (network is available again)

This means that if any of the payload containers or initContainers requires network access - it is sensitive to this issue:

when a payload container requires network connectivity on start - it will crashloop until the sidecar proxy is started
a situation when any of the initContainers depends on fetching the data over the network (and fails otherwise) introduces a deadlock because none of the payload or sidecar containers can start until all the initContainers complete.

The initContainers deadlock issue is beyond the scope of this post as it doesn’t affect the Kubeflow training jobs.

The Job completion issue

Apart from the racy network availability during the Pod startup, there’s another issue with the Kubernetes Job-like resources and their handling of the sidecar. Depending on the type of a Kubernetes Controller or an Operator managing the created resource, the problem is that the Istio Proxy keeps running after the payload container is completed and prevents the Job (and Job-like) resources from completion.

Training Operators on Istio

When running distributed training jobs using Tensorflow, PyTorch, or MXNet Operators, it is pretty standard for the training code to access the dataset at a remote location over the network (e.g. from cloud storage). This makes it sensitive to the network availability issue and can lead to sporadic failures when running on Istio. Tensorflow will be used for illustration purposes here, however, the problem surface and approaches to solving it are equally applicable to the PyTorch, MXNet, and other Training Operators.

Let’s consider this naive MNIST classification code as an example workload. Note, the mnist.load_data() call downloads the sample dataset from a remote location and requires the network to be available.

import tensorflow as tf

# import MNIST dataset
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# define and compile the model
model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(128, activation="relu"),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10),
    ]
)

model.compile(
  optimizer="adam",
  loss="sparse_categorical_crossentropy",
  metrics=["accuracy"]
)

# train the model
model.fit(x_train, y_train, epochs=5)

To run this code on a Kubernetes cluster, it needs to be saved into a file (for example, mnist.py) and packed into a Docker image so that it can be pulled and used on any of the cluster nodes by training operator workers. We will use a pre-built Docker image that already includes the code from the above snippet: datastrophic/tensorflow:2.6.0-mnist. Let’s create the following TFJob:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: mnist
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: tensorflow
            image: datastrophic/tensorflow:2.6.0-mnist
            command: ['python', '-u', 'mnist.py']

It can take some time to pull the image but once it is pulled and launched we can check its logs to see it was unable to download the dataset. For that, let’s look into one of the worker pods logs:

$> kubectl logs mnist-worker-0 -c tensorflow

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Traceback (most recent call last):

... <part of the log omitted for better readability>

Exception: URL fetch failure on https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz: None -- [Errno 101] Network is unreachable

Although after a couple of attempts the Job will be able to start and pull the data - in a situation when Istio Proxy becomes ready before the payload container attempts to access the network - the Job won’t be able to complete with a single container still running. And this single container is the Istio Proxy that is unaware of other sidecars. We can see the event timeline here:

$> kubectl get pod mnist-worker-0 -w
NAME             READY   STATUS            RESTARTS   AGE
mnist-worker-0   0/2     Init:0/1          0          1s
mnist-worker-0   0/2     PodInitializing   0          6s
mnist-worker-0   1/2     Running           0          9s
mnist-worker-0   2/2     Running           0          16s
mnist-worker-0   1/2     NotReady          0          2m21s

Let’s now take a look into the prior art and the possible workarounds discussed in the community.

Prior art

There were quite a few discussions, threads, and blog posts about how these issues can be resolved or if there’s any workaround for them. What follows is a quick overview of the most frequently mentioned approaches.

The networking issue

One of the most common solutions for this problem is to modify the container command and wait for the sidecar proxy to become available as for example recommended in istio/issues#11130. The modified command for the TFJob can look as follows:

command: ['bash', '-c']
args: ['until curl --silent --head --fail localhost:15000 > /dev/null; do sleep 1; done; python -u mnist.py']

The entrypoint probes Envoy proxy port 15000 until it becomes available and executes the training code only after that.

Yet another intuitive solution when the network access to the remote data is not stable is to introduce retries in the source code responsible for its retrieval. For example:

@retry(wait_fixed=1000)
def load_dataset():
    mnist = tf.keras.datasets.mnist
    return mnist.load_data()

This looks more like a bandaid for the given example but, in general, retries can improve resilience and help to avoid transient failures in the presence of unreliable data sources.

The sidecar termination issue

One of the available approaches is similar to the Envoy probing and proposes to change the entrypoint and terminate the Istio Proxy either via pkill or by calling a dedicated endpoint http://127.0.0.1:15020/quitquitquit. Based on this GitHub comment, the final entrypoint command for the example MNIST TFJob would look like this:

command: ["/bin/bash", "-c"]
args:
  - |
    trap "curl --max-time 2 -s -f -XPOST http://127.0.0.1:15000/quitquitquit" EXIT
    while ! curl -s -f http://127.0.0.1:15020/healthz/ready; do sleep 1; done
    python -u mnist.py

An important note on using pkill instead of /quitquitquit is that pkill would require a shared process namespace between containers in the pod which has its own security implications.

Another approach described in Handling Istio Sidecars in Kubernetes Jobs proposes a helper process to wrap the entrypoint and communicate with Envoy waiting for it to start and terminating it after the wrapped application stops.

The Good, Bad, and Ugly: Istio for Short-lived Pods proposes to inject a wrapper binary, and overwrite the entrypoint command via a webhook, and then trigger the binary subcommand from an accompanying controller to terminate the proxy (similar to kubectl exec).

Conclusion

All the approaches described above have pros and cons but the main drawback is that the initial workloads can not be moved to Istio without modifying either the manifests, entrypoints, or the source code (in case of retries). At any reasonable scale, the number of changes would be significant enough to abandon an initiative like this one. The automated mutation of the entrypoint looks the closest to a proper solution, however, proposes to inject an init container with a wrapper binary and mutate the entrypoint which is not always feasible as there could be issues related to container ordering and multi-container pods.

Meet Istio AUX Controller

Overview

All the workarounds and the lack of an out-of-the-box solution lead me to prototyping a simple MutatingAdmissionWebhook and a Pod Controller that aimed at solving the above issues with the following principles in mind:

The existing user code including Kubernetes manifests should not change to work on Istio.
Full automation. Once the solution is in place - it can be enabled or disabled per namespace by a user.
Narrow scope and the low impact that doesn’t require changing of the global settings.
Container entrypoint must not be mutated. The majority of the workarounds deal with single-container Pods in Jobs. There might be other containers dependent on the network.

The good news is that in version 1.7, Istio introduced a global configuration property values.global.proxy.holdApplicationUntilProxyStarts that injects the sidecar container at the beginning of the container list of a Pod and causes other containers to wait until it starts. This is described in great details in a blog post by Marko Lukša: Delaying application start until sidecar is ready.

Istio AUX contains a MutatingAdmissionWebhook that mutates the pods submitted to namespaces with specific labels and adds an Istio-specific annotation to Pods:

proxy.istio.io/config: "holdApplicationUntilProxyStarts: true"

That way, Istio Operator will take care of the rearranging of the sidecars and delaying the first non-Istio container start until the proxy is ready. This can also be solved, by setting the same Istio Proxy property globally, however, it is false by default and it’s not clear whether this setting can impact other existing deployments outside Kubeflow.

Another part of the Istio AUX Controller is the Controller itself that is also scoped to namespaces with specific labels and subscribed to Pod Update events. All the container status changes trigger the reconciliation, and the controller keeps checking what containers are still running in the Pod. Once there’s only one left and it is Istio Proxy, the Controller execs into a pod and runs curl -sf -XPOST http://127.0.0.1:15020/quitquitquit inside it. Istio Proxy container image has curl pre-installed so there’s no need for an additional binary or a sidecar to terminate the proxy.

The termination heuristic is pretty naive but it is easy to extend it to a more sophisticated version e.g. checking against a list of container names that have to exit prior to terminating the Proxy.

Istio AUX Controller is a reference implementation for the above approach and is available on GitHub at datastrophic/istio-aux.

Demo

Prerequisites

You should have a Kubernetes cluster available, kind will suffice but ensure the Docker daemon has sufficient resources to accommodate for cert-manager, Istio, Kubeflow Training Operator, and run a two-pod TFJob (8CPU, 8GB RAM should be sufficient). The following software is required:

Cluster setup

The cluster setup is pretty straightforward. The only highlight here is that we will use the Composite Operator that supports all types of training jobs (former TF Operator).

kind create cluster

# wait for node(s) to become ready
kubectl wait --for condition=Ready node --all

# install cert-manager
kubectl create -f https://github.com/jetstack/cert-manager/releases/download/v1.5.3/cert-manager.yaml

# wait for pods to become ready
kubectl wait --for=condition=Ready pods --all --namespace cert-manager

# install istio
istioctl install --set profile=demo -y

# install the training operator
kubectl apply -k "github.com/kubeflow/tf-operator.git/manifests/overlays/standalone?ref=master"

# wait for pods to become ready
kubectl wait --for=condition=Ready pods --all --namespace kubeflow


# install the Istio AUX controller
kubectl apply -k "github.com/datastrophic/istio-aux.git/config/default?ref=master"

Deploying the workloads

Let’s create a TFJob that will be used for testing, enable Istio injection for the default namespace, and submit the job:

kubectl label namespace default istio-injection=enabled

cat <<EOF >./tfjob.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: mnist
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: tensorflow
            image: datastrophic/tensorflow:2.6.0-mnist
            command: ['python', '-u', 'mnist.py']
EOF

kubectl create -f tfjob.yaml

kubectl get pods -w

We’ll see that the Pods will eventually get stuck in the NotReady state with one container still running.

Now let’s enable the Istio AUX Controller for the default namespace and redeploy the TFJob one more time.

kubectl delete -f tfjob.yaml

kubectl label namespace default io.datastrophic/istio-aux=enabled

kubectl create -f tfjob.yaml

kubectl get pods -w

This time, all the pods reached the Completed state.

In the meantime, the Istio AUX Controller logs contain an output like this:

...
INFO	webhook.webhook	processing pod mnist-worker-0
INFO	webhook.webhook	pod mnist-worker-0 processed
...
INFO	webhook.webhook	processing pod mnist-worker-1
INFO	webhook.webhook	pod mnist-worker-1 processed
...
INFO	istio-aux	found a pod with istio proxy, checking container statuses	{"pod": "mnist-worker-0"}
INFO	istio-aux	some containers are still running, skipping istio proxy shutdown	{"pod": "mnist-worker-0", "containers": ["tensorflow"]}
...
INFO	istio-aux	found a pod with istio proxy, checking container statuses	{"pod": "mnist-worker-1"}
INFO	istio-aux	some containers are still running, skipping istio proxy shutdown	{"pod": "mnist-worker-1", "containers": ["tensorflow"]}
...
INFO	istio-aux	found a pod with istio proxy, checking container statuses	{"pod": "mnist-worker-0"}
INFO	istio-aux	the payload containers are terminated, proceeding with the proxy shutdown	{"pod": "mnist-worker-0"}
...
INFO	istio-aux	found a pod with istio proxy, checking container statuses	{"pod": "mnist-worker-1"}
INFO	istio-aux	the payload containers are terminated, proceeding with the proxy shutdown	{"pod": "mnist-worker-1"}
...
INFO	istio-aux	found a pod with istio proxy, checking container statuses	{"pod": "mnist-worker-0"}
INFO	istio-aux	istio-proxy is already in a terminated state	{"pod": "mnist-worker-0"}
...
INFO	istio-aux	found a pod with istio proxy, checking container statuses	{"pod": "mnist-worker-1"}
INFO	istio-aux	istio-proxy is already in a terminated state	{"pod": "mnist-worker-1"}
...

Final thoughts

The proposed solution works for existing versions of Kubernetes and Istio but given the fast pace of their evolution might become outdated relatively quickly. It would be nice to have similar functionality in either but it is understandable that container interdependencies in Pods do not generalize well for a universal generic solution.

Ideally, it would be great to have this problem solved by Kubernetes itself. As described in Sidecar container lifecycle changes in Kubernetes 1.18, it was proposed to assign containers with a lyfecycle type so that the sidecars would be terminated by the Kubelet once the payload containers complete.

Although the reference implementation addresses a specific case and a subset of Kubeflow Operators it provides a relatively generic solution to a problem but of course, requires additional work to productionize it.

Please don’t hesitate to reach out to me with feedback and/or if you are interested in collaboration.