Back to projects

MIDS Capstone Thesis

KubeBench

KubeBench is a Kubernetes-native benchmark and serving platform for code-generating LLMs. The project combines domain fine-tuning, runtime validation against real clusters, and cloud deployment infrastructure to evaluate and serve practical DevOps assistants.

KubeBench architecture overview

Problem

General-purpose code models often perform poorly on Kubernetes workflows, and string-based metrics fail to capture whether generated YAML actually works in real clusters.

Approach

Built a domain benchmark that evaluates generated manifests through operational checks against Kubernetes APIs, then combined those checks with quality scoring and model comparisons.

Outcome

Produced an end-to-end pipeline for data, fine-tuning, runtime evaluation, and deployment, plus a web interface to interact with specialized Kubernetes models.

Thesis Scope

Fine-Tuning and Models

  • Built and versioned Kubernetes-focused datasets with DVC.
  • Trained multiple QLoRA model candidates (Qwen and Gemma families).
  • Tracked candidate artifacts and model iterations for selection.
  • Focused on code-generation tasks for Kubernetes and related infra workflows.

Benchmark and Evaluation

  • Generated and evaluated task suites for cluster-centric scenarios.
  • Executed runtime validation against Minikube and GKE environments.
  • Scored outputs with operational evaluators and judge-based rubrics.
  • Prioritized executable correctness over string similarity.

Serving and Platform Engineering

The serving stack uses a two-tier architecture: a FastAPI proxy layer on DigitalOcean App Platform for request orchestration and validation, plus a GPU-backed FastAPI model server on GCP for inference.

Infrastructure and deployment are automated through Terraform modules and GitHub Actions, enabling repeatable provisioning and model service rollout.

Artifacts