With Kubeflow gaining traction in the community and its early adoption in enterprises, security and observability concerns become more and more important. Many organizations that are running AI/ML workloads, operate with sensitive personal or financial data and have stricter requirements for data encryption, traceability, and access control. Quite often, we can see the use of the Istio service mesh for solving these problems and gaining other benefits of the rich functionality it provides.
Kubeflow relies on Istio for traffic routing, authorization policies, and user access control. However, at the moment of writing, it did not fully support Istio for the workloads running on top of it. This post covers architectural and design issues specific to running Kubeflow workloads on Istio and focuses on specific problems of the AI/ML training jobs: TFJob
, PyTorchJob
, and alike. In the end, the post presents a reference implementation of the Istio Aux Controller - an auxiliary Kubernetes Operator that helps to solve these problems in a fully automated manner.
Istio
High-level architecture
It is important to have a basic understanding of how Istio is designed at a high level.
Official documentation provides an in-depth overview of all components but for the purpose of this post, we will be focusing mostly on the data plane.
Image source: Istio architectrue documentation
Istio Control Plane injects Envoy proxies as sidecar containers running alongside the payload containers in the same pod. Once the proxy is up and running, it starts managing all network communication between pods in the mesh and also receiving configuration updates from the Control Plane. All the access policies and traffic routes are configured via Control Plane and then enforced by proxies.
To enable sidecar injection at the namespace level, the namespace should have
istio-injection: enabled
label
Sidecar injection
Let’s take a deeper look into the timeline of the events when Istio injection is enabled and a new Pod
is being created:
- The Istio CNI plugin configures Pod’s
iptables
to route all traffic to the Proxy. - If there are any
initContainers
specified, they start and must complete prior to starting the payload and sidecar containers. - Payload and sidecar containers start.
The network availability issue
While the injection model looks straightforward, there’s one major design flaw here - the Pod
network is unreachable until the proxy sidecar starts. Let’s revisit the timeline from this perspective:
- Istio CNI plugin configures routing of all traffic to a non-existent proxy (network becomes unavailable)
initContainers
run- Payload and sidecar containers start
- Proxy starts (network is available again)
This means that if any of the payload containers or initContainers
requires network access - it is sensitive to this issue:
- when a payload container requires network connectivity on start - it will crashloop until the sidecar proxy is started
- a situation when any of the
initContainers
depends on fetching the data over the network (and fails otherwise) introduces a deadlock because none of the payload or sidecar containers can start until all theinitContainers
complete.
The
initContainers
deadlock issue is beyond the scope of this post as it doesn’t affect the Kubeflow training jobs.
The Job completion issue
Apart from the racy network availability during the Pod
startup, there’s another issue with the Kubernetes Job-like resources and their handling of the sidecar. Depending on the type of a Kubernetes Controller or an Operator managing the created resource, the problem is that the Istio Proxy keeps running after the payload container is completed and prevents the Job
(and Job-like) resources from completion.
Training Operators on Istio
When running distributed training jobs using Tensorflow, PyTorch, or MXNet Operators, it is pretty standard for the training code to access the dataset at a remote location over the network (e.g. from cloud storage). This makes it sensitive to the network availability issue and can lead to sporadic failures when running on Istio. Tensorflow will be used for illustration purposes here, however, the problem surface and approaches to solving it are equally applicable to the PyTorch, MXNet, and other Training Operators.
Let’s consider this naive MNIST classification code as an example workload. Note, the mnist.load_data()
call downloads the sample dataset from a remote location and requires the network to be available.
import tensorflow as tf
# import MNIST dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
# define and compile the model
model = tf.keras.models.Sequential(
[
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10),
]
)
model.compile(
optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
# train the model
model.fit(x_train, y_train, epochs=5)
To run this code on a Kubernetes cluster, it needs to be saved into a file (for example, mnist.py
) and packed into a Docker image so that it can be pulled and used on any of the cluster nodes by training operator workers. We will use a pre-built Docker image that already includes the code from the above snippet: datastrophic/tensorflow:2.6.0-mnist
. Let’s create the following TFJob
:
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: mnist
spec:
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
containers:
- name: tensorflow
image: datastrophic/tensorflow:2.6.0-mnist
command: ['python', '-u', 'mnist.py']
It can take some time to pull the image but once it is pulled and launched we can check its logs to see it was unable to download the dataset. For that, let’s look into one of the worker pods logs:
$> kubectl logs mnist-worker-0 -c tensorflow
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Traceback (most recent call last):
... <part of the log omitted for better readability>
Exception: URL fetch failure on https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz: None -- [Errno 101] Network is unreachable
Although after a couple of attempts the Job
will be able to start and pull the data - in a situation when Istio Proxy becomes ready before the payload container attempts to access the network - the Job
won’t be able to complete with a single container still running. And this single container is the Istio Proxy that is unaware of other sidecars. We can see the event timeline here:
$> kubectl get pod mnist-worker-0 -w
NAME READY STATUS RESTARTS AGE
mnist-worker-0 0/2 Init:0/1 0 1s
mnist-worker-0 0/2 PodInitializing 0 6s
mnist-worker-0 1/2 Running 0 9s
mnist-worker-0 2/2 Running 0 16s
mnist-worker-0 1/2 NotReady 0 2m21s
Let’s now take a look into the prior art and the possible workarounds discussed in the community.
Prior art
There were quite a few discussions, threads, and blog posts about how these issues can be resolved or if there’s any workaround for them. What follows is a quick overview of the most frequently mentioned approaches.
The networking issue
One of the most common solutions for this problem is to modify the container command and wait for the sidecar proxy to become available as for example recommended in istio/issues#11130. The modified command for the TFJob
can look as follows:
command: ['bash', '-c']
args: ['until curl --silent --head --fail localhost:15000 > /dev/null; do sleep 1; done; python -u mnist.py']
The entrypoint probes Envoy proxy port 15000
until it becomes available and executes the training code only after that.
Yet another intuitive solution when the network access to the remote data is not stable is to introduce retries in the source code responsible for its retrieval. For example:
@retry(wait_fixed=1000)
def load_dataset():
mnist = tf.keras.datasets.mnist
return mnist.load_data()
This looks more like a bandaid for the given example but, in general, retries can improve resilience and help to avoid transient failures in the presence of unreliable data sources.
The sidecar termination issue
One of the available approaches is similar to the Envoy probing and proposes to change the entrypoint and terminate the Istio Proxy either via pkill
or by calling a dedicated endpoint http://127.0.0.1:15020/quitquitquit
. Based on this GitHub comment, the final entrypoint command for the example MNIST TFJob
would look like this:
command: ["/bin/bash", "-c"]
args:
- |
trap "curl --max-time 2 -s -f -XPOST http://127.0.0.1:15000/quitquitquit" EXIT
while ! curl -s -f http://127.0.0.1:15020/healthz/ready; do sleep 1; done
python -u mnist.py
An important note on using
pkill
instead of/quitquitquit
is thatpkill
would require a shared process namespace between containers in the pod which has its own security implications.
Another approach described in Handling Istio Sidecars in Kubernetes Jobs proposes a helper process to wrap the entrypoint and communicate with Envoy waiting for it to start and terminating it after the wrapped application stops.
The Good, Bad, and Ugly: Istio for Short-lived Pods proposes to inject a wrapper binary, and overwrite the entrypoint command via a webhook, and then trigger the binary subcommand from an accompanying controller to terminate the proxy (similar to kubectl exec
).
Conclusion
All the approaches described above have pros and cons but the main drawback is that the initial workloads can not be moved to Istio without modifying either the manifests, entrypoints, or the source code (in case of retries). At any reasonable scale, the number of changes would be significant enough to abandon an initiative like this one. The automated mutation of the entrypoint looks the closest to a proper solution, however, proposes to inject an init container with a wrapper binary and mutate the entrypoint which is not always feasible as there could be issues related to container ordering and multi-container pods.
Meet Istio AUX Controller
Overview
All the workarounds and the lack of an out-of-the-box solution lead me to prototyping a simple MutatingAdmissionWebhook
and a Pod
Controller that aimed at solving the above issues with the following principles in mind:
- The existing user code including Kubernetes manifests should not change to work on Istio.
- Full automation. Once the solution is in place - it can be enabled or disabled per namespace by a user.
- Narrow scope and the low impact that doesn’t require changing of the global settings.
- Container entrypoint must not be mutated. The majority of the workarounds deal with single-container
Pods
inJobs
. There might be other containers dependent on the network.
The good news is that in version 1.7, Istio introduced a global configuration property values.global.proxy.holdApplicationUntilProxyStarts
that injects the sidecar container at the beginning of the container list of a Pod
and causes other containers to wait until it starts. This is described in great details in a blog post by Marko Lukša: Delaying application start until sidecar is ready.
Istio AUX contains a MutatingAdmissionWebhook
that mutates the pods submitted to namespaces with specific labels and adds an Istio-specific annotation to Pods:
proxy.istio.io/config: "holdApplicationUntilProxyStarts: true"
That way, Istio Operator will take care of the rearranging of the sidecars and delaying the first non-Istio container start until the proxy is ready. This can also be solved, by setting the same Istio Proxy property globally, however, it is false
by default and it’s not clear whether this setting can impact other existing deployments outside Kubeflow.
Another part of the Istio AUX Controller is the Controller itself that is also scoped to namespaces with specific labels and subscribed to Pod
Update events. All the container status changes trigger the reconciliation, and the controller keeps checking what containers are still running in the Pod
. Once there’s only one left and it is Istio Proxy, the Controller execs into a pod and runs curl -sf -XPOST http://127.0.0.1:15020/quitquitquit
inside it. Istio Proxy container image has curl
pre-installed so there’s no need for an additional binary or a sidecar to terminate the proxy.
The termination heuristic is pretty naive but it is easy to extend it to a more sophisticated version e.g. checking against a list of container names that have to exit prior to terminating the Proxy.
Istio AUX Controller is a reference implementation for the above approach and is available on GitHub at datastrophic/istio-aux.
Demo
Prerequisites
You should have a Kubernetes cluster available, kind will suffice but ensure the Docker daemon has sufficient resources to accommodate for cert-manager, Istio, Kubeflow Training Operator, and run a two-pod TFJob
(8CPU, 8GB RAM should be sufficient). The following software is required:
Cluster setup
The cluster setup is pretty straightforward. The only highlight here is that we will use the Composite Operator that supports all types of training jobs (former TF Operator).
kind create cluster
# wait for node(s) to become ready
kubectl wait --for condition=Ready node --all
# install cert-manager
kubectl create -f https://github.com/jetstack/cert-manager/releases/download/v1.5.3/cert-manager.yaml
# wait for pods to become ready
kubectl wait --for=condition=Ready pods --all --namespace cert-manager
# install istio
istioctl install --set profile=demo -y
# install the training operator
kubectl apply -k "github.com/kubeflow/tf-operator.git/manifests/overlays/standalone?ref=master"
# wait for pods to become ready
kubectl wait --for=condition=Ready pods --all --namespace kubeflow
# install the Istio AUX controller
kubectl apply -k "github.com/datastrophic/istio-aux.git/config/default?ref=master"
Deploying the workloads
Let’s create a TFJob
that will be used for testing, enable Istio injection for the default namespace, and submit the job:
kubectl label namespace default istio-injection=enabled
cat <<EOF >./tfjob.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: mnist
spec:
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
containers:
- name: tensorflow
image: datastrophic/tensorflow:2.6.0-mnist
command: ['python', '-u', 'mnist.py']
EOF
kubectl create -f tfjob.yaml
kubectl get pods -w
We’ll see that the Pods
will eventually get stuck in the NotReady
state with one container still running.
Now let’s enable the Istio AUX Controller for the default namespace and redeploy the TFJob
one more time.
kubectl delete -f tfjob.yaml
kubectl label namespace default io.datastrophic/istio-aux=enabled
kubectl create -f tfjob.yaml
kubectl get pods -w
This time, all the pods reached the Completed
state.
In the meantime, the Istio AUX Controller logs contain an output like this:
...
INFO webhook.webhook processing pod mnist-worker-0
INFO webhook.webhook pod mnist-worker-0 processed
...
INFO webhook.webhook processing pod mnist-worker-1
INFO webhook.webhook pod mnist-worker-1 processed
...
INFO istio-aux found a pod with istio proxy, checking container statuses {"pod": "mnist-worker-0"}
INFO istio-aux some containers are still running, skipping istio proxy shutdown {"pod": "mnist-worker-0", "containers": ["tensorflow"]}
...
INFO istio-aux found a pod with istio proxy, checking container statuses {"pod": "mnist-worker-1"}
INFO istio-aux some containers are still running, skipping istio proxy shutdown {"pod": "mnist-worker-1", "containers": ["tensorflow"]}
...
INFO istio-aux found a pod with istio proxy, checking container statuses {"pod": "mnist-worker-0"}
INFO istio-aux the payload containers are terminated, proceeding with the proxy shutdown {"pod": "mnist-worker-0"}
...
INFO istio-aux found a pod with istio proxy, checking container statuses {"pod": "mnist-worker-1"}
INFO istio-aux the payload containers are terminated, proceeding with the proxy shutdown {"pod": "mnist-worker-1"}
...
INFO istio-aux found a pod with istio proxy, checking container statuses {"pod": "mnist-worker-0"}
INFO istio-aux istio-proxy is already in a terminated state {"pod": "mnist-worker-0"}
...
INFO istio-aux found a pod with istio proxy, checking container statuses {"pod": "mnist-worker-1"}
INFO istio-aux istio-proxy is already in a terminated state {"pod": "mnist-worker-1"}
...
Final thoughts
The proposed solution works for existing versions of Kubernetes and Istio but given the fast pace of their evolution might become outdated relatively quickly. It would be nice to have similar functionality in either but it is understandable that container interdependencies in Pods
do not generalize well for a universal generic solution.
Ideally, it would be great to have this problem solved by Kubernetes itself. As described in Sidecar container lifecycle changes in Kubernetes 1.18, it was proposed to assign containers with a lyfecycle
type so that the sidecars would be terminated by the Kubelet once the payload containers complete.
Although the reference implementation addresses a specific case and a subset of Kubeflow Operators it provides a relatively generic solution to a problem but of course, requires additional work to productionize it.
Please don’t hesitate to reach out to me with feedback and/or if you are interested in collaboration.