MIDS Capstone Thesis

KubeBench

KubeBench is a Kubernetes-native benchmark and serving platform for code-generating LLMs. The project combines domain fine-tuning, runtime validation against real clusters, and cloud deployment infrastructure to evaluate and serve practical DevOps assistants.

Problem

General-purpose code models often perform poorly on Kubernetes workflows, and string-based metrics fail to capture whether generated YAML actually works in real clusters.

Approach

Built a domain benchmark that evaluates generated manifests through operational checks against Kubernetes APIs, then combined those checks with quality scoring and model comparisons.

Outcome

Produced an end-to-end pipeline for data, fine-tuning, runtime evaluation, and deployment, plus a web interface to interact with specialized Kubernetes models.

Thesis Scope

Fine-Tuning and Models

Built and versioned Kubernetes-focused datasets with DVC.
Trained multiple QLoRA model candidates (Qwen and Gemma families).
Tracked candidate artifacts and model iterations for selection.
Focused on code-generation tasks for Kubernetes and related infra workflows.

Benchmark and Evaluation

Generated and evaluated task suites for cluster-centric scenarios.
Executed runtime validation against Minikube and GKE environments.
Scored outputs with operational evaluators and judge-based rubrics.
Prioritized executable correctness over string similarity.

Serving and Platform Engineering

The serving stack uses a two-tier architecture: a FastAPI proxy layer on DigitalOcean App Platform for request orchestration and validation, plus a GPU-backed FastAPI model server on GCP for inference.

Infrastructure and deployment are automated through Terraform modules and GitHub Actions, enabling repeatable provisioning and model service rollout.

Artifacts

Live Demo Fine-Tuning + Eval Repo Web + Deployment Repo