Scaling a Synthetic Fine-Tuning Dataset 30x With No API Budget

A neon cyberpunk infographic titled LLM Fine-Tuning: Turn Data Into Capability, showing a six-stage funnel from data collection through deployment, with the slogan "better models start with better data."

Where part one left off

In the first article I fine-tuned Qwen2.5-Coder-14B-Instruct into a Kubernetes troubleshooting assistant: given a symptom, it answers in a tight SRE shape (cause, confirm, fix, prevent). The whole pipeline ran on one MacBook Pro M5 Max, the training data was written by Claude Opus acting as a teacher ^[1], and the adapter was trained with low-rank adaptation (LoRA) ^[2] using Apple's MLX ^[3], then served through llama.cpp ^[4].

That run worked, but it had one honest limitation: 47 training examples. The model learned the response style in well under an epoch and then spent the rest of training memorizing those 47 rows. Training loss collapsed toward zero while validation loss bottomed early and climbed. The article ended with the obvious prescription: scale the data to a couple of thousand examples and retrain.

This is that follow-up. Two things turned out to be interesting: the validation loss genuinely improved with scale, and getting to 2,000 examples without an API budget required a workaround I did not expect to be writing about.

New to the terms here (LoRA, validation loss, epoch, quantization)? The first article defines them in plain language. This post assumes them.

The headline: data size versus validation loss

I ran the same recipe three times, each with a larger dataset, and recorded the best validation loss (the model's error on held-out examples it never trained on, where lower is better):

Run	Train examples	Categories	Best val loss
v1	47	13	1.075
v2	635	22	0.864
v3	1,439	33	0.821

The "train examples" column above is the training split; the full corpus grew from 65 to 2,053 examples over the same runs (a 47-row split came out of v1's 65-example corpus, and 1,439 out of v3's 2,053). Either way the increase is roughly 30x, and it was the single biggest quality lever, which is exactly what scaling-law work would predict: with the model and compute fixed, more and more-diverse data lowers the achievable loss ^[5], ^[6].

The more telling signal is how the loss curves changed shape. In v1, training loss collapsed to about 0.03, the fingerprint of a model memorizing 47 rows. In v3, training loss settled near 0.8 instead, which is what genuine learning looks like: the model is fitting a pattern it cannot simply memorize. The v3 descent also lasted longer before overfitting:

Iter	v3 val loss
50	1.006
200	0.910
350	0.870
500	0.821 ← best
600	0.859
700+	rising (overfit)

Every run still overfits after roughly 0.7 to 1 epoch, so the recipe is unchanged from part one: checkpoint often and keep the validation-loss minimum, which is the classic early-stopping rule ^[7]. What scale buys you is a lower minimum and broader coverage, not freedom from overfitting.

The problem: no API budget

The plan in part one assumed I would generate the extra examples with a script (generate_data.py) calling the Anthropic API, reusing the same system prompt and taxonomy. This is the standard synthetic-data recipe: have a capable model author instruction-and-response pairs, the approach behind Self-Instruct ^[8] and the instruction-tuning datasets that followed it.

Then I hit a wall that does not show up in papers: the work was running on a Claude Code Max subscription, which has no pay-per-token API key. There was no token endpoint for a script to call. I still had a very capable model available, just not in the shape generate_data.py expected.

The workaround: Claude Code subagents as a data generator

Claude Code can spawn subagents ^[9], each a separate agent instance with its own context and task. So instead of one script making thousands of API calls, I expanded the taxonomy to about 120 seed scenarios (drawn from public Kubernetes troubleshooting documentation) and, from a single orchestrating Claude Code session, used its subagent (Task) tool to spawn roughly 20 subagents, each handed a disjoint slice of the taxonomy plus the shared SRE system prompt and told to author 90 to 120 examples into its own file.

Each seed is not one example. Following the same varied-phrasing idea from part one, every agent turns a seed into several distinct rows (a raw kubectl paste, a described symptom, a log snippet), which is how ~120 seeds fan out to ~2,000 examples without becoming repetitive.

I split the work by model to spend the subscription's capacity where it mattered: Opus for the subtle categories (container exit codes, networking, DNS, RBAC and SELinux) and Sonnet for the more mechanical ones. In a distillation setup the labels are the teaching signal, so accuracy on the hard categories is worth the extra cost, while the easy ones do not need it.

This is still teacher-to-student distillation; it just uses an interactive agent harness as the generation engine instead of a billed API. The same caveat from part one applies: an imitation dataset inherits the teacher's blind spots, and that critique lands hardest on broad, general-purpose imitation ^[10]. A narrow, curated, heavily quality-gated domain like this one is more defensible, but the teacher's gaps still propagate, so the data still needs spot-checking.

What actually went wrong (and how to avoid it)

The subagent approach worked, but only after several real failures worth passing on:

Make subagents write a script, not raw JSON. The first agents tried to hand-type JSONL directly into a file and hit the model's output-token cap mid-array, producing truncated, invalid files. The reliable pattern is to have each agent write a small Python builder: author the user and assistant text as Python strings, then json.dumps each line. That guarantees valid escaping and keeps the agent's visible output tiny, well under the cap.
Give each agent a disjoint file. Twenty agents writing gen01.jsonl through gen20.jsonl never collide. A central build_dataset.py then merges, deduplicates on the normalized user prompt, and applies the quality gate (every assistant answer must name at least one concrete kubectl command).
Subscription limits are real, so make the work resumable. The first wave of 18 agents hit the Max plan's 5-hour session limit mid-run: 9 finished, 9 failed with a session-reset message. Because training is local MLX compute and does not touch the subscription, I trained an interim model (v2) on whatever had finished: the 9 completed subagent categories plus part one's 13 hand-authored ones, which is the 22 categories in the table above. So v2 was an opportunistic checkpoint, not a deliberately chosen dataset size. I then resumed the failed 9 after the reset and added a top-up wave to reach v3's 33 categories.
Commit after every batch. Each wave was validated and pushed to git immediately, so a generation run that cost real session time was never more than one batch away from being saved.

The final corpus was 2,053 validated examples across 33 category files (20 from subagents, 13 hand-authored), split 1,439 / 307 / 307 for train, validation, and test.

Retraining: more data lets you turn the knobs back up

Part one's lesson on 47 examples was to shrink the adapter (rank 8, 8 layers, low learning rate) because anything larger memorized instantly. With 1,439 examples that constraint goes away. A well-curated dataset of this size is past the threshold where a modest amount of high-quality data reliably teaches a behavior ^[11], so v3 restored the larger configuration:

num_layers: 16        # LoRA on the top 16 of 48 transformer blocks
batch_size: 2
iters: 1100
learning_rate: 1.0e-4
lora_parameters:
  rank: 16
  scale: 16.0
  dropout: 0.05

That trained at about 0.66 iterations/second (batch 2) with peak memory around 32 GB of the 128 GB available, and reached its best validation loss of 0.821 at iteration 500. With 1,439 examples at batch size 2, one epoch is roughly 720 iterations, so iteration 500 is about 0.7 epoch. I let it run the full 1,100 iterations to see the whole overfitting curve (validation loss turns upward after ~700), then deployed the iter-500 checkpoint, the same minimum-checkpoint decision as part one.

The deployable model is the same path as before: fuse the iter-500 adapter into the base, convert to GGUF, quantize to Q5_K_M (9.8 GB), and serve it through llama.cpp's OpenAI-compatible API. The one new production touch is making it durable: a small launchd agent (a property-list file in ~/Library/LaunchAgents with KeepAlive set) runs llama-server on login and restarts it if it dies, so the fine-tuned model is always up as a local service. One gotcha worth knowing: with KeepAlive enabled, a plain kill will not stop the server because launchd respawns it; use launchctl bootout to actually take it down.

Does it actually feel better?

Lower validation loss is necessary but not sufficient, so the real check is still a qualitative one. v3 keeps the terse cause-confirm-fix-prevent shape from v1, but the broader 33-category coverage shows up where v1 was thin: scheduling and quota failures, NetworkPolicy default-deny, init-container dependency loops, and the security categories now get specific, correct diagnostic commands instead of the generic checklist a smaller model falls back to. The model is also less likely to pattern-match every prompt onto CrashLoopBackOff, which was a visible failure mode when the training set was dominated by a few categories.

Honest limits and what is next

Held-out testing is still loss-based. I have a 307-example test split but have not yet scored it with a rubric. The right next step is an LLM-as-judge pass (is the root cause correct, are the commands valid, is the fix runnable) rather than leaning on a single loss number.
Synthetic data is still synthetic. None of these 2,053 examples came from a real cluster. Folding in sanitized real kubectl transcripts would ground the style in reality and surface gaps the teacher does not know it has.
A bigger base is one config line away. The pipeline is unchanged for a 32B base; only the model name and the timings move, and the M5 Max has the memory headroom.

The broader takeaway is simple: better models start with better data. Part one proved the machinery worked. Part two confirmed that the machinery was never the bottleneck. The dataset was, and the most useful engineering was not in the training loop at all. It was in generating, deduplicating, quality-gating, and scaling the data, even when the obvious tool (a billed API) was not available.

References

[1] G. Hinton, O. Vinyals, and J. Dean, "Distilling the Knowledge in a Neural Network," arXiv:1503.02531, 2015.

[2] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, "LoRA: Low-Rank Adaptation of Large Language Models," arXiv:2106.09685, 2021.

[3] Apple Machine Learning Research, "MLX: An array framework for Apple silicon," github.com/ml-explore/mlx, accessed June 2026.

[4] G. Gerganov and contributors, "llama.cpp: LLM inference in C/C++," github.com/ggml-org/llama.cpp, accessed June 2026.

[5] J. Kaplan, S. McCandlish, T. Henighan, et al., "Scaling Laws for Neural Language Models," arXiv:2001.08361, 2020.

[6] J. Hoffmann, S. Borgeaud, A. Mensch, et al., "Training Compute-Optimal Large Language Models," arXiv:2203.15556, 2022.

[7] L. Prechelt, "Early Stopping, But When?" in Neural Networks: Tricks of the Trade, Springer, 1998. doi.org/10.1007/3-540-49430-8_3.

[8] Y. Wang, Y. Kordi, S. Mishra, et al., "Self-Instruct: Aligning Language Models with Self-Generated Instructions," arXiv:2212.10560, 2022.

[9] Anthropic, "Claude Code: Subagents," docs.claude.com/en/docs/claude-code/sub-agents, accessed June 2026.

[10] A. Gudibande, E. Wallace, C. Snell, et al., "The False Promise of Imitating Proprietary LLMs," arXiv:2305.15717, 2023.

[11] C. Zhou, P. Liu, P. Xu, et al., "LIMA: Less Is More for Alignment," arXiv:2305.11206, 2023.