datastrophic

KEMU: A Declarative Approach to Emulating Kubernetes Clusters at Scale

4 November 2025·12 mins

Kubernetes Emulation Kind Kwok Kemu

Optimizing AI workload scheduling requires extensive experimentation and observation, but testing scheduler modifications in production is risky: configuration errors can cause multi-day delays and wasted capacity. This post introduces KEMU, a declarative Kubernetes Emulator Utility that replaces fragmented multi-tool cluster setups with a single configuration specification, enabling safe experimentation with large-scale GPU clusters on minimal resources.

Secure Kubeflow Ingress and Authentication with Istio External Auth, Dex, and OAuth2 Proxy

16 December 2021·16 mins

Kubernetes Istio Kubeflow

Publicly exposed insecure service endpoints on Kubernetes produce a major risk of malicious workloads being deployed on your clusters. We’ve seen reports of the Kubernetes Dashboard, the Kubeflow Central Dashboard, and the Kubeflow Pipelines all were compromised when publicly exposed to the Internet. Combined with wide RBAC permissions, a publicly exposed software with workload scheduling capabilities opens your clusters for malicious deployments to anybody knowing the endpoint URL. This blog post focuses on building a secure ingress and authentication stack on Kubernetes with Istio targeting Kubeflow installations.

The Ultimate Kubernetes Homelab Guide: From Zero to Production Cluster On-Premises

1 December 2021·14 mins

Kubernetes Devops

Whether you’re looking for a more powerful development environment or a production-grade Kubernetes cluster for experiments, this guide provides end-to-end deployment and configuration instructions to get the cluster up and running. The first part of this guide covers the planning and provisioning of the infrastructure with Proxmox and Terraform. The second part is dedicated to installing Kubernetes and essential software such as Calico for networking, OpenEBS for volume provisioning, and MetalLB for network load balancing.

Kubeflow Training Operators and Istio: solving the proxy sidecar lifecycle problem for AI/ML workloads

4 October 2021·12 mins

Kubernetes Kubeflow Istio Operators Mlops

With Kubeflow gaining traction in the community and its early adoption in enterprises, security and observability concerns become more and more important. Many organizations that are running AI/ML workloads, operate with sensitive personal or financial data and have stricter requirements for data encryption, traceability, and access control. Quite often, we can see the use of the Istio service mesh for solving these problems and gaining other benefits of the rich functionality it provides.

Spark JobServer: from Spark Standalone to Mesos, Marathon and Docker

12 October 2017·9 mins

Spark Mesos Marathon Docker

After several years of running Spark JobServer workloads, the need for better availability and multi-tenancy emerged across several projects author was involved in. This blog post covers design decisions made to provide higher availability and fault tolerance of JobServer installations, multi-tenancy for Spark workloads, scalability and failure recovery automation, and software choices made in order to reach these goals. Spark JobServer Spark JobServer is widely used across a variety of reporting and aggregating systems.

Recent

KEMU: A Declarative Approach to Emulating Kubernetes Clusters at Scale

Secure Kubeflow Ingress and Authentication with Istio External Auth, Dex, and OAuth2 Proxy

The Ultimate Kubernetes Homelab Guide: From Zero to Production Cluster On-Premises

Kubeflow Training Operators and Istio: solving the proxy sidecar lifecycle problem for AI/ML workloads

Spark JobServer: from Spark Standalone to Mesos, Marathon and Docker