← Back to Blog

Understanding --cache-ram in llama.cpp: Prompt Caching, Eviction Errors, and Apple Silicon

·8 min read
LLMllama.cppApple SiliconInfrastructureKV Cache

This post documents a debugging session that started with mysterious KV cache eviction errors on my llama-server and ended with a clearer understanding of what --cache-ram actually does, how it interacts with KV cache quantization on Apple Silicon, and why most online guides get it wrong.

The Problem

I run a llama-server on an M5 Max (128 GB unified memory) serving Qwen 3.6 35B-A3B with 128k context, 2 parallel slots, q8_0 KV cache quantization, and Flash Attention. The setup is documented in my previous benchmarking post, with the launchd plist configured for persistent inference on a local network.

The server was throwing a relentless error during long-context agentic sessions:

failed to find free space in the KV cache, retrying with smaller batch size

Requests would fail, slots would report busy, and conversations would get cut short after 15 to 20 tool calls. The math seemed straightforward: the model is approximately 25 GB, each KV cache slot is approximately 4 GB at q8_0 128k context, total is approximately 33 GB. With 128 GB available, I should have 95 GB of headroom. Where was the memory going?

What Most Guides Say (And Get Wrong)

Many online guides describe --cache-ram as forcing the KV cache into system RAM instead of VRAM. This is incorrect.

I found articles on craftrigs.com, various Medium posts, and Reddit discussions all saying the same thing: set --cache-ram to a huge value to give yourself more KV cache space. The misconception is so widespread that I believed it myself initially. The reasoning seemed to follow: "Your GPU has limited VRAM. Use system RAM for the cache instead."

The fundamental issue with this explanation is that it is simply wrong.

--cache-ram does NOT control where the active KV cache lives. The active KV cache, the one used during token generation, always stays in GPU memory. On Apple Silicon, that is Metal memory. On NVIDIA hardware, that is VRAM. The parameter does not enable offloading. It does something completely different.

This misconception leads people to set --cache-ram to 32 GB or 64 GB thinking it gives them more KV cache space. It does not. It does something else entirely, and when combined with the default settings on memory-constrained systems, it creates exactly the problem I was seeing: KV cache eviction errors on what should be plenty of memory.

What --cache-ram Actually Does

Introduced in PR #16391 (October 2025), --cache-ram controls host-memory prompt caching. It is separate from the active KV cache entirely.

Here is how it works: when a request finishes and a slot goes idle, the server saves that slot's KV state to host RAM. When a new request arrives with a matching prefix (same system prompt, same conversation history prefix), the server restores the cached state instead of re-processing the prompt from scratch. This provides up to 93% time-to-first-token (TTFT) reduction for cached prefixes.

The default value is 8192 MiB, which equals 8 GiB. This has been enabled by default since October 2025.

The critical distinction: the active KV cache (the one used during inference) is separate and always lives in GPU/Metal memory. The prompt cache is a host-RAM feature for prefix reuse optimization.

For agentic workflows with consistent system prompts, this is extremely valuable. Every request shares the same system prompt prefix; if the prefix is cached in host RAM, prefill time drops from ~60 seconds (re-processing 128k tokens) to ~200 milliseconds (restoring from cache).

Why 8 GiB Caused Problems

On my M5 Max with 128 GB: model weights approximately 25 GB, 2 KV cache slots approximately 8 GB, system overhead approximately 13 GB, total approximately 46 GB used, approximately 69 GB free. Plenty of room, on paper.

But --cache-ram 8192 allocates an additional 8 GB of host RAM for prompt caching. On Apple Silicon, host RAM and GPU memory are the same unified pool. There is no separate system RAM and VRAM.

The prompt cache competes directly with the Metal allocator for unified memory. Under certain memory pressure conditions (other applications running, file cache growth, macOS WindowServer activity), this competition causes the Metal backend to fail KV cache allocations.

The eviction errors were not because the active KV cache was too large. They occurred because the prompt cache's 8 GB allocation reduced the available unified memory below the threshold the Metal allocator needed for KV cache operations. When the Metal backend tried to allocate a new KV cache for an incoming request, the unified memory allocator could not find a contiguous or sufficiently available block, causing the eviction.

This is unique to Apple Silicon's unified memory architecture. On NVIDIA hardware, host RAM and VRAM are physically separate. When --cache-ram consumes 8 GB of host RAM on an RTX 4090 with 24 GB VRAM, the GPU's VRAM allocations are completely unaffected. But on Apple Silicon, host RAM and GPU memory are the same resource, so the prompt cache's allocation directly starves the active KV cache.

The Fix

Three options, ordered by preference:

Option 1: Reduce prompt cache size Set --cache-ram 2048 or --cache-ram 4096. Still get prompt caching benefits with less memory pressure on the unified allocator. For agentic workflows with system prompts, even 2 to 4 GB of prompt cache is enough to cache 2 to 4 full conversation prefixes.

Option 2: Disable prompt caching entirely Set --cache-ram 0. No TTFT benefit from cached prefixes, but eliminates the memory competition. On my server, I chose this path because the Qwen 3.6 model is already fast at prefill (1,137 t/s at 128k context), and the memory stability was more important than the prompt caching optimization.

Option 3: Add --kv-unified This flag is required for prompt caching to work properly anyway. Without it, the server silently disables idle slot caching and prints a warning. If you want to keep --cache-ram enabled, always pair it with --kv-unified.

Here is the updated launchd plist with --cache-ram 0 and --kv-unified added:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.local.llama-server</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/llama-server</string>
        <string>-m</string>
        <string>/Models/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf</string>
        <string>-ngl</string>
        <string>99</string>
        <string>-fa</string>
        <string>1</string>
        <string>-c</string>
        <string>131072</string>
        <string>--cache-type-k</string>
        <string>q8_0</string>
        <string>--cache-type-v</string>
        <string>q8_0</string>
        <string>--cache-ram</string>
        <string>0</string>
        <string>--kv-unified</string>
        <string>--mlock</string>
        <string>--parallel</string>
        <string>2</string>
        <string>--jinja</string>
        <string>--host</string>
        <string>0.0.0.0</string>
        <string>--port</string>
        <string>8080</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/tmp/llama-server.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/llama-server.err</string>
</dict>
</plist>

Interaction with Other Flags

FlagControlsMemory Pool
--cache-type-k q8_0Active KV cache key data typeGPU/Metal memory
--cache-type-v q8_0Active KV cache value data typeGPU/Metal memory
--cache-ram NPrompt cache size limitHost RAM (unified on Apple Silicon)
--kv-unifiedSingle shared KV bufferGPU/Metal memory
-c 131072Context window sizeDetermines active KV cache size
--parallel NNumber of request slotsMultiplies active KV cache size

Understanding these flags separately prevents the common mistake of thinking --cache-ram affects the active KV cache size or placement.

Known Issues

Several issues with --cache-ram and prompt caching have been reported on the llama.cpp repository:

Metal plus KV offload crash (issue #23578, May 2026): Setting --cache-ram with certain quantization types causes a GGML_ASSERT(buf_dst) failed crash during KV cache updates. Workaround: set --cache-ram 0 or add --no-kv-offload.

Slow cache updates (fixed in build b8185): Prompt cache updates could take 80 or more seconds, defeating the purpose of the optimization. This was a regression introduced in October 2025 and fixed by mid-May 2026.

--kv-unified plus --cache-reuse conflict (issue #23493): Using both flags together can cancel newly allocated tasks. If you need --cache-reuse for multi-turn caching, test thoroughly before deploying.

Memory-constrained devices: On phones, tablets, or 16 GB laptops, the default 8 GB --cache-ram allocation can starve the active KV cache. Always set --cache-ram 0 on memory-constrained systems unless you are running a very small model with short context.

Measuring Prompt Cache Effectiveness

If you enable --cache-ram with a non-zero value, look for these log lines in llama-server output:

prompt cache is enabled, size limit: 8192 MiB

Monitor time-to-first-token metrics: the first request to a slot will be slow (full prefill phase), subsequent requests with matching prefixes will be dramatically faster (cache hit).

For agentic workflows where every request shares the same system prompt, the cache is extremely effective. You will observe 200 to 500 millisecond TTFT for cached requests versus 30 to 120 seconds for uncached full prefill.

Summary; Lessons Learned

--cache-ram is prompt caching for prefix reuse, not KV cache placement. It does not move the active KV cache to system RAM; it caches idle KV states in host RAM for fast restoration.

On Apple Silicon, prompt cache competes with the active KV cache for the same unified memory pool. The default 8 GiB allocation is too aggressive for systems already running large models with long context windows and concurrent request slots.

Always pair --cache-ram with --kv-unified for correct behavior. Without the flag, prompt caching silently disables.

When debugging KV cache eviction errors, check total memory consumption including the prompt cache, not just model weights plus KV cache math. The prompt cache's host-RAM allocation reduces the available unified memory pool on Apple Silicon.

For most production deployments on Apple Silicon, set --cache-ram 0 unless your workload has extremely consistent prompts and TTFT optimization is critical. Memory stability is more valuable than the prompt caching speedup.