Fine-Tuning an Open-Weight Large Language Model for Kubernetes Troubleshooting on Apple Silicon
![]()
What I set out to build
I wanted a small, opinionated model that behaves like a senior site reliability engineer (SRE) when you hand it a Kubernetes symptom. Not a general chat model that knows a bit of everything, but a specialist that, given kubectl output or a described failure, reliably does four things in order:
- States the most likely root cause.
- Names the exact commands to confirm it.
- Gives a concrete, copy-pasteable fix.
- Ends with a one-line prevention tip.
That shape, cause → confirm → fix → prevent, is the whole product. A stock instruct model already knows plenty of Kubernetes; what it lacks is the discipline to answer in tight SRE muscle memory instead of a hedging wall of generic advice. Teaching behavior, not facts, turns out to be exactly what a small fine-tune is good at [1].
The entire pipeline runs on a single laptop, a MacBook Pro M5 Max with 128 GB of unified memory, using Apple's MLX framework [2] for training and llama.cpp [3] for serving. And the training data isn't scraped or hand-written by me: it's synthesized by Claude Opus 4.8 acting as a teacher. So this is also a worked example of the classic knowledge-distillation recipe [4]: use a strong closed model to bootstrap a specialized open-weight one.
A few terms up front, in case this isn't your daily vocabulary. A large language model is a network of billions of numeric weights, built on the transformer architecture [5] (the 14B in this model's name means about 14 billion of them), that together predict the next token, a word or fragment of a word produced by a byte-pair-encoding tokenizer [6]. Fine-tuning is taking an already-trained model and training it a little more on a small, focused set of examples so it leans toward a specific behavior, the pre-train-then-fine-tune paradigm popularized by models like BERT [7]. The base model is the off-the-shelf starting point; open-weight means its weights are published for anyone to download and adapt, unlike a closed model such as Claude, whose weights stay private behind an application programming interface (API). Inference is just running the finished model to generate answers, and serving means keeping it running as an always-on service other tools can send requests to. Finally, knowledge distillation [4] is the trick this project leans on: have a strong closed model (the teacher) write training data for a smaller open one (the student), transferring the behavior without ever touching the teacher's weights.
Claude Opus → synthetic K8s Q&A (JSONL)
→ train/valid/test split (mlx-lm chat format)
→ LoRA fine-tune Qwen2.5-Coder-14B (mlx_lm.lora)
→ evaluate (loss + held-out prompts)
→ fuse adapter → convert to GGUF → quantize
→ serve with llama.cpp
Why these choices
Base model: Qwen2.5-Coder-14B-Instruct. Kubernetes troubleshooting is mostly YAML Ain't Markup Language (YAML), shell, and code, exactly a coding model's strength [8]. At 14B it's the sweet spot on a 128 GB box: fast iteration, room for long contexts, and good enough reasoning. The same pipeline scales to a 32B or 70B base later by changing one line of config.
Low-rank adaptation (LoRA), not full fine-tuning. Full fine-tuning rewrites all of the model's billions of weights, which is slow and memory-hungry. LoRA instead leaves the original weights frozen and trains a small set of extra matrices [9], the adapter, that nudge the model's internal math (its attention and feed-forward layers). The adapter is only tens of megabytes, trains in minutes, is trivial to version, and is cheap to throw away when an experiment doesn't pan out. When I'm happy with one, I fuse it, meaning I merge those nudges back into the base weights to get a single standalone model for deployment.
MLX on Apple Silicon. mlx-lm [10] ships mlx_lm.lora (train), mlx_lm.fuse (merge), and mlx_lm.generate (inference) out of the box, and it's built for unified-memory Apple Silicon. The Metal graphics processing unit (GPU) reads weights in place from the same memory pool the central processing unit (CPU) uses, with near-zero copy overhead.
Serving with llama.cpp. The M5 Max already serves other models through llama.cpp, so the final stages convert the fused model into GGUF [11] (the file format llama.cpp loads) and quantize it: shrink each weight from 16 bits down to about 5, which cuts the model's size and memory use by roughly two-thirds in exchange for a small, usually unnoticeable, drop in quality [12]. Then it's served exactly like every other model on the box.
Remote development. The work was driven from a second Mac running Claude Code, orchestrating the M5 Max over SSH on a Tailscale network; the M5 Max did the heavy compute.
┌──────────────────┐ ┌──────────────────────────┐
│ Driver Mac │ SSH/Tailscale │ M5 Max (128 GB) │
│ (Claude Code) │ ──────────► │ - MLX env (uv, py3.12) │
│ - repo + docs │ rsync data/cfg │ - mlx_lm.lora training │
│ - data gen │ ──────────► │ - fuse + GGUF convert │
│ - orchestration │ │ - llama.cpp serving │
└──────────────────┘ └──────────────────────────┘
The dataset is the product
For this project I treated the dataset as the product, not an afterthought. The dataset is the lever; everything downstream just preserves whatever behavior the data encodes.
Every example is generated under one system prompt, shared by the teacher (Opus while authoring) and the student (at train and inference time), so the learned behavior matches the deployment prompt exactly:
You are a senior Kubernetes SRE assistant. For each problem:
1. State the most likely root cause(s), briefly.
2. Give the exact kubectl/diagnostic commands to confirm it.
3. Provide a concrete fix (commands or YAML, copy-pasteable).
4. End with a one-line prevention tip.
Be terse and technical. No filler. Assume kubectl is configured.
To stop the model from overfitting to "everything is CrashLoopBackOff," the examples are deliberately spread across the real distribution of Kubernetes failures [13]: CrashLoopBackOff, ImagePullBackOff, OOMKilled, Pending scheduling, PersistentVolumeClaim (PVC) pending, services with no endpoints, DNS (Domain Name System) resolution, ingress routing, role-based access control (RBAC) forbidden, probe failures, config/secret issues, node NotReady, evictions, init-container loops, ResourceQuota, stuck rollouts, HorizontalPodAutoscaler (HPA) not scaling, NetworkPolicy default-deny, and more. Each category gets multiple distinct scenarios phrased differently (a raw kubectl paste, a described symptom, a log snippet) so the student generalizes across input styles.
The output is the mlx-lm chat format, one JavaScript Object Notation (JSON) object per line:
{"messages": [
{"role": "system", "content": "<the SRE system prompt>"},
{"role": "user", "content": "<symptom / kubectl output / question>"},
{"role": "assistant", "content": "<cause → confirm → fix → prevent>"}
]}A build_dataset.py script merges the per-category files, then enforces a quality gate: valid JSON per line; non-empty system/user/assistant turns; every assistant answer must name at least one concrete kubectl command (vague "check your logs" answers are rejected); and dedupe on the normalized user prompt. It then splits deterministically:
wrote 47 -> data/train.jsonl
wrote 9 -> data/valid.jsonl
wrote 9 -> data/test.jsonl
Total valid records: 65 (from 13 category files)
Sixty-five examples is intentionally tiny, enough to prove the pipeline end to end and produce a visibly more SRE-shaped model, cheap to iterate on. The 65 records come from 13 hand-authored category files of uneven size, ranging from 4 to 8 examples each (the busier failure modes like CrashLoopBackOff get more). The same generate_data.py script can scale this to thousands via the Anthropic API using the identical prompt and taxonomy, flowing through the exact same validation. More on why you'd want that below.
Training the LoRA
I trained LoRA with the mlx_lm.lora trainer [10] on the full-precision bf16 base rather than quantized LoRA (QLoRA) [14] on a 4-bit build. The reason is the deployment target: training on bf16 means mlx_lm.fuse produces clean bf16 weights that convert to GGUF without stacking two different quantization schemes on top of each other. On 128 GB, a 14B bf16 base plus a LoRA adapter is comfortable. Peak memory never crossed ~31 GB.
The interesting part wasn't the config; it was watching the model overfit in real time, because the dataset is so small.
First, the vocabulary. Training loss measures how well the model predicts the answers in the examples it is actively learning from. Validation loss measures the same thing on a held-out set the model never trains on. Loss is essentially the model's average surprise at the correct next token, so lower is better, and validation loss is the honest signal: the model cannot memorize its way to a good score on data it has never seen.
Two more terms the tables use. One iteration (the Iter column below) is a single training step on one batch of examples; with a batch size of 1, that is one example. One epoch is a full pass over the entire training set. So with 47 training examples, every 47 iterations is one epoch, and Run 1's 400 iterations work out to roughly 8 epochs over the same 47 rows. The more epochs you run on a small set, the more chances the model has to memorize it instead of learning the general pattern.
A good fit looks like both numbers falling together and then validation loss flattening out near its minimum: the model has learned the pattern and keeps generalizing to unseen examples. Overfitting is the opposite. Training loss keeps dropping, because the model is memorizing the exact training rows, while validation loss stalls and then turns upward, because it is getting worse at everything outside that handful of examples. The widening gap between the two curves is the tell.
There is no universal "good" number for loss; its scale depends on the data and the tokenizer (the part that chops text into tokens), so what you watch is the shape of the curves, not an absolute value. The ideal stopping point is simply the iteration where validation loss is lowest, and you save that snapshot of the weights, called a checkpoint, rather than the final one. This is the classic early-stopping rule [15]. With the tables below, read down the validation-loss column and look for where it bottoms out.
The runs below also list a few training knobs. rank and num_layers set how much capacity the adapter has: how many extra parameters it adds, and how many of the model's layers it touches. scale (also called alpha) sets how strongly the trained adapter is applied on top of the frozen weights. lr, the learning rate, is how big an adjustment each training step makes. dropout randomly ignores part of the adapter during training to discourage memorization [16]. iters is the total number of training steps. On a dataset this small, more capacity and more steps mostly buy you faster overfitting.
Run 1 (num_layers=16, rank=16, lr=1e-4, iters=400, roughly 8 epochs over 47 rows) overfit hard. Training loss collapsed toward 0.03 (near-perfect memorization of the training set) while validation loss bottomed early, around iteration 50, and then climbed back up, a textbook overfitting curve:
| Iter | Val loss | Train loss |
|---|---|---|
| 1 | 2.554 | n/a |
| 50 | 1.301 | ~0.2 |
| 100 | 1.443 | ~0.1 |
| 150 | 1.464 | 0.115 |
| 200 | 1.740 | 0.055 |
| 250 | 1.822 | 0.039 |
| 300 | 1.782 | 0.026 |
Two problems: it was memorizing 47 examples, and with save_every=100 I never even checkpointed the validation minimum around iter 50.
Run 2 dialed capacity and learning rate down (num_layers=8, rank=8, scale=16, lr=5e-5, dropout=0.1, iters=150) and checkpointed every 25 iterations. Lower minimum, gentler curve:
| Iter | Val loss |
|---|---|
| 1 | 2.554 |
| 25 | 1.129 |
| 50 | 1.075 ← best |
| 75 | 1.152 |
| 100 | 1.228 |
| 125 | 1.416 |
| 150 | 1.471 |
Training was fast: ~2.4 it/s, ~780 tok/s, peak memory ~30.9 GB of 128. I deployed the iter-50 checkpoint (lowest validation loss), not the final iter-150 weights.
The durable takeaway: with a tiny dataset the model learns the style in well under one to two epochs [1], and everything after that is memorization. Early checkpoints plus frequent saving beat training longer. The real fix for the ceiling is more data, not more steps.
Evaluation: don't trust one loss number
I checked the result two ways: a quantitative loss on the held-out test split, and a qualitative side-by-side of base vs fine-tuned on unseen prompts. For a behavioral fine-tune, the qualitative diff is the signal that actually matters: automated language-model metrics like loss and perplexity correlate poorly with how well a model actually follows instructions, which is why human and side-by-side comparative evaluation are the standard for this kind of work [17], [18].
The quantitative path produced a genuine gotcha. On the held-out split, both the base and the adapter reported the same Test loss 0.945, Test ppl 2.572 (perplexity, ppl, is just the loss run through e^x, a second way of writing the same error). In the version I used (mlx-lm 0.31.3), mlx_lm.lora --test --adapter-path evaluated the base weights in both runs. The adapter wasn't being applied on the eval path. So that number is a base reference, not a base-vs-tuned delta. The lesson is blunt: verify the adapter actually changes outputs rather than trusting a single scalar.
And it does change outputs. Here's a real held-out prompt, "A Deployment rollout is stuck. kubectl rollout status says 2 of 5 updated replicas available; new pods are Running but never Ready. What now?", run through both models with the same SRE system prompt:
############ BASE (Qwen2.5-Coder-14B-Instruct) ############
1. **Root Cause(s):**
- Pod readiness probes failing.
- Resource constraints (CPU/Memory).
- Misconfigured environment variables.
2. **Diagnostic Commands:**
kubectl describe deployment <deployment-name>
kubectl get pods -l app=<app-label> -o wide
kubectl logs <pod-name>
kubectl get events --sort-by=.metadata.creationTimestamp
3. **Concrete Fix:**
- Check readiness probe configuration and adjust if necessary.
- Scale down the deployment temporarily to free up resources.
- Verify and update environment variables if misconfigured.
4. **Prevention Tip:**
- Regularly monitor pod events and logs for anomalies.
########## FINE-TUNED (base + iter-50 LoRA) ##########
Root cause: the new pods are Running but never Ready. The readiness probe is failing.
Confirm:
kubectl get pods -l app=<app> -o wide
kubectl describe pod <pod> | grep -iE 'Readiness|Unhealthy'
Fix: correct the readiness probe (or the app) so it passes. For example:
readinessProbe:
httpGet: { path: /healthz, port: 80 }
periodSeconds: 5
kubectl apply -f <dep>.yaml
Prevention: validate readiness probe paths and app readiness in staging before prod.
Look at what the fine-tune produced. It commits to the single most likely cause instead of hedging across three. It drops the bold-markdown scaffolding for the terse Root cause: / Confirm: / Fix: / Prevention: shape from the training data. It uses the signature diagnostic idiom it was taught, kubectl describe pod <pod> | grep -iE 'Readiness|Unhealthy', rather than a generic kubectl logs. And it's shorter and more actionable: 143 generated tokens versus 174 for the base. Generation ran at ~16 tok/s for the 14B bf16 model plus adapter, peaking around 29.8 GB.
Deploying through llama.cpp
An MLX adapter can't be loaded by llama.cpp directly, so deployment is a three-step round trip: fuse the adapter into the base, convert to GGUF, then quantize.
# 1. fuse adapter into a standalone HF-format model
mlx_lm fuse \
--model Qwen/Qwen2.5-Coder-14B-Instruct \
--adapter-path adapters \
--save-path ~/models/k8s-coder-14b/fused-hf
# 2. convert to GGUF (f16)
python ~/llama.cpp/convert_hf_to_gguf.py ~/models/k8s-coder-14b/fused-hf \
--outfile ~/models/k8s-coder-14b/k8s-coder-14b-f16.gguf --outtype f16
# 3. quantize (Q5_K_M is a good quality/size balance for a 14B)
~/llama.cpp/build/bin/llama-quantize \
~/models/k8s-coder-14b/k8s-coder-14b-f16.gguf \
~/models/k8s-coder-14b/k8s-coder-14b-Q5_K_M.gguf Q5_K_MTwo gotchas bit me here, and they'll bite you too:
mlx_lm.fuse --export-gguffails for Qwen2 withValueError: Model type qwen2 not supported for GGUF conversion. MLX's built-in exporter only covers a handful of architectures, so you fuse to Hugging Face (HF) format and convert with llama.cpp'sconvert_hf_to_gguf.pyinstead.convert_hf_to_gguf.pyneedstorchandgguf, which a fresh uv venv lacks (and there's nopip):uv pip install --python .venv/bin/python gguf torch.
The artifacts at the end of the round trip:
| File | Size | Notes |
|---|---|---|
fused-hf/ | ~27 GB | bf16 HF model (6 safetensors shards) from mlx_lm.fuse |
k8s-coder-14b-f16.gguf | 28 GB | 579 tensors, from convert_hf_to_gguf.py |
k8s-coder-14b-Q5_K_M.gguf | 9.8 GB | 5.69 bits per weight; quantize took ~33 s |
Served with llama-server on 127.0.0.1:8080, it loads in under a second from cache. I chose Q5_K_M over the near-lossless but larger Q8_0 (~16 GB) and the smaller, slightly more degraded Q4_K_M (~9 GB): SRE answers have to be command-correct, and Q5 keeps that precision while still leaving plenty of memory for context. The key question is whether the fine-tuned behavior survives quantization. It does. Here's a live query through the OpenAI-compatible API, same SRE system prompt:
User: "A PVC is stuck Pending and there is no default StorageClass. How do I fix it?"
Root cause: no default StorageClass exists. The PVC can't find a provisioner.
Confirm:
kubectl get sc
kubectl get pvc <pvc> -o jsonpath='{.status.conditions[?(@.type=="Provisioning")].message}{"\n"}'
Fix: create a default StorageClass (example for AWS EBS): …(YAML)…
kubectl patch storageclass standard -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Prevention: always create a default StorageClass at cluster setup.
One caveat on the model's answer: the jsonpath filter on a Provisioning condition is non-standard. A PVC's state lives in .status.phase (here, Pending), and provisioning detail surfaces through kubectl describe pvc events rather than a condition of that name. The diagnosis, fix, and prevention are all correct; only that one probe command would need adjusting before you paste it. It is a good reminder that synthetic-data idioms still need a human spot-check.
That's the trained behavior intact after a full fuse → GGUF → quantize → serve trip: the cause → confirm → fix → prevent shape and the exact is-default-class patch idiom from the training data, now running as a 9.8 GB quantized model right next to every other model on the box.
The whole loop, on one laptop
Claude Opus authored the data → mlx_lm.lora trained a rank-8 adapter on Qwen2.5-Coder-14B on the M5 Max → fused → converted to GGUF → quantized to Q5_K_M → served by llama.cpp, where it answers Kubernetes problems in a consistent SRE shape. Every stage ran locally.
| Stage | Result |
|---|---|
| Dataset | 65 Opus-authored examples, 13 failure categories → 47 train / 9 valid / 9 test |
| Base model | Qwen2.5-Coder-14B-Instruct (bf16, ~29 GB download) |
| LoRA | rank 8, scale 16, dropout 0.1, top 8 layers, lr 5e-5 |
| Train speed | ~2.4 it/s, ~780 tok/s, peak mem ~31 GB / 128 GB |
| Best checkpoint | iter 50, val loss 1.075 (down from 2.55) |
| Fused → GGUF | 28 GB f16 → 9.8 GB Q5_K_M (quantize ~33 s) |
| Serving | llama.cpp llama-server, OpenAI-compatible API on :8080 |
What I'd do next
The honest limitation is the data. Forty-seven training examples buys strong style transfer and little new knowledge [1], and synthetic data inherits the teacher's blind spots [19] (that critique lands hardest on broad, general-purpose imitation; the narrow, curated distillation here is more defensible, though the teacher's knowledge gaps still propagate). So the obvious next moves all point at the dataset:
- Scale to 500–2000 examples with the same prompt and taxonomy, so a couple of real epochs are possible without immediate overfitting and the long-tail failure modes get covered.
- Hold out by category for evaluation, and add a large language model (LLM) as a judge [18]: have Opus score each answer against a rubric, for example whether the root cause is correctly identified, whether the fix includes runnable commands, and whether the prevention tip is non-trivial.
- Add real, sanitized cluster transcripts to ground the synthetic style in reality.
- Try a 32B base for harder reasoning; the pipeline is unchanged, only
model:and the timings move.
The broader point is that the whole thing, from generate and train to evaluate, quantize, and serve, fits comfortably on one Apple Silicon laptop, and the most valuable engineering wasn't the training run. It was curating the dataset and watching validation loss closely enough to catch overfitting before it shipped.
References
[1] C. Zhou, P. Liu, P. Xu, et al., "LIMA: Less Is More for Alignment," arXiv:2305.11206, 2023.
[2] Apple Machine Learning Research, "MLX: An array framework for Apple silicon," github.com/ml-explore/mlx, accessed June 2026.
[3] G. Gerganov and contributors, "llama.cpp: LLM inference in C/C++," github.com/ggml-org/llama.cpp, accessed June 2026.
[4] G. Hinton, O. Vinyals, and J. Dean, "Distilling the Knowledge in a Neural Network," arXiv:1503.02531, 2015.
[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention Is All You Need," arXiv:1706.03762, 2017.
[6] OpenAI, "tiktoken: a fast byte-pair-encoding tokeniser," github.com/openai/tiktoken, accessed June 2026. Qwen2.5's tokenizer is a byte-pair-encoding tokenizer in this family; the foundational subword method is R. Sennrich, B. Haddow, and A. Birch, "Neural Machine Translation of Rare Words with Subword Units," arXiv:1508.07909, 2016.
[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805, 2018.
[8] B. Hui, J. Yang, Z. Cui, J. Yang, et al., "Qwen2.5-Coder Technical Report," arXiv:2409.12186, 2024. Model card: huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct.
[9] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, "LoRA: Low-Rank Adaptation of Large Language Models," arXiv:2106.09685, 2021.
[10] Apple, "MLX LM: Fine-tuning and generation utilities for LLMs in MLX," github.com/ml-explore/mlx-lm, accessed June 2026.
[11] ggml-org, "GGUF file format specification," github.com/ggml-org/ggml/blob/master/docs/gguf.md, accessed June 2026.
[12] T. Dettmers and L. Zettlemoyer, "The Case for 4-bit Precision: k-bit Inference Scaling Laws," arXiv:2212.09720, 2022.
[13] The Kubernetes Authors, "Troubleshooting Applications," kubernetes.io/docs/tasks/debug/debug-application, accessed June 2026.
[14] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, "QLoRA: Efficient Finetuning of Quantized LLMs," arXiv:2305.14314, 2023.
[15] L. Prechelt, "Early Stopping, But When?" in Neural Networks: Tricks of the Trade, Springer, 1998. doi.org/10.1007/3-540-49430-8_3.
[16] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," Journal of Machine Learning Research, 2014. jmlr.org/papers/v15/srivastava14a.html.
[17] L. Ouyang, J. Wu, X. Jiang, et al., "Training Language Models to Follow Instructions with Human Feedback," arXiv:2203.02155, 2022.
[18] L. Zheng, W.-L. Chiang, Y. Sheng, et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena," arXiv:2306.05685, 2023.
[19] A. Gudibande, E. Wallace, C. Snell, et al., "The False Promise of Imitating Proprietary LLMs," arXiv:2305.15717, 2023.