The LLM Field Manual: A Complete Guide to Language Model Terminology
I am not an expert in machine learning or natural language processing. This is a working reference I built for myself while experimenting with local LLMs on Apple Silicon. Treat it as field notes from a practitioner, not a textbook. If something is wrong or out of date, I would genuinely appreciate the correction.
What is an LLM?
A Large Language Model (LLM) is a neural network trained on massive amounts of text to predict the next token in a sequence. The "large" refers to the number of trainable parameters, typically ranging from a few billion to hundreds of billions. These parameters encode patterns in language: grammar, facts, reasoning strategies, coding ability.
LLMs are autoregressive text generators. Given a prompt, they produce one token at a time, feeding each generated token back into the model to produce the next. Everything else (chat interfaces, tool calling, code completion) is built on top of this mechanism.
The Transformer architecture
Nearly all modern LLMs use the Transformer architecture (2017). Know these components:
| Term | What it does |
|---|---|
| Attention | Lets the model weigh how much each token should influence the next prediction. "Self-attention" means every token attends to every other token. |
| Multi-Head Attention | Runs multiple attention computations in parallel, each learning different aspects of the input (syntax, semantics, position). |
| Feed-Forward Network | A dense layer applied per token after attention. Stores most of the model's learned knowledge. |
| Layer | One transformer block: attention + feed-forward. Models stack many layers (32, 64, 128+) for depth. |
| Embedding | The numerical representation of a token. Text becomes embeddings, flows through layers, then maps back to token probabilities. |
Attention variants
The attention variant determines KV cache size and inference speed. Know the differences:
| Variant | What it means for you |
|---|---|
| MHA (Multi-Head Attention) | The original. Each head gets its own Key, Value, and Query projections. Full quality, highest memory cost. Found in older models like Llama 1. |
| MQA (Multi-Query Attention) | All heads share one Key/Value projection, keep separate Queries. Cuts KV cache by 8x or more. Slight quality tradeoff. |
| GQA (Grouped-Query Attention) | The current standard. Heads share Key/Value projections in groups (e.g., 32 query heads, 8 KV groups = 4x cache reduction, minimal quality loss). Used by Llama 3, Qwen, Gemma, and most modern models. |
Bottom line: GQA with 8 KV groups uses roughly 4x less memory for its KV cache than MHA. This is why newer models handle much longer context windows without running out of memory.
Positional encoding
Transformers have no built-in sense of token order. Positional encoding adds position information so the model distinguishes "the cat sat" from "sat the cat."
| Method | When you'll see it |
|---|---|
| Absolute positional encoding | Fixed position vectors per token. Capped at training length. Used by the original Transformer and GPT-2. Legacy. |
| RoPE (Rotary Position Embedding) | Rotates query/key vectors by position. The dominant method today (Llama, Qwen, Gemma). Supports extension beyond training length. |
| ALiBi (Attention with Linear Biases) | Adds a linear distance penalty. No learned parameters. Naturally extrapolates to longer sequences. |
| RoPE scaling | Extends context beyond training length. Linear scaling (simple, lossy), NTK-aware (better), YaRN (best, combines methods). |
When a model advertises "128k context" but trained on 8k, it uses RoPE scaling. Quality at extended lengths depends on the scaling method and whether the model was fine-tuned at the longer length.
Flash Attention
Flash Attention computes attention in tiles that fit in fast on-chip SRAM instead of materializing the full attention matrix. Faster, less memory, same output.
| Term | What to know |
|---|---|
| Flash Attention | Tiled attention. Drops memory from O(n²) to O(n), runs 2 to 4x faster. Required for efficient long-context inference. |
| Flash Attention 2 | Better parallelism and work partitioning. Standard in most inference engines today. |
| Flash Attention 3 | Hopper (H100) GPU optimizations. Adds FP8 and async computation. |
| PagedAttention | Used by vLLM. Manages KV cache like virtual memory pages. Not related to Flash Attention despite frequent confusion. |
In practice: enable Flash Attention in llama.cpp with the -fa flag. Required for KV cache quantization. Recommended for any context above 4k tokens.
# Enable Flash Attention on a server
llama-server \
-m model.gguf \
-faParameters and model sizing
Parameter count is the primary size metric. Parameters are the weights and biases learned during training.
| Notation | Meaning | Example |
|---|---|---|
| B | Billion parameters | Llama 3.1 70B = 70 billion parameters |
| M | Million parameters | Rare in modern LLMs |
| Active params | Parameters used per token (MoE only) | Qwen3 35B-A3B uses 3B of its 35B per token |
Quick math: each parameter in fp16 takes 2 bytes. A 70B model needs ~140 GB just for weights before quantization.
Tokens and tokenization
Models process tokens (subword units), not raw text. A tokenizer handles the conversion.
| Term | What to know |
|---|---|
| Token | A subword unit. "unhappiness" might become ["un", "happiness"]. Common words stay whole; rare words split. |
| Vocabulary | The full token set. Typical sizes: 32k to 152k tokens. |
| BPE | Byte Pair Encoding. The dominant tokenization algorithm. Merges frequent character pairs iteratively. |
| Context length | Maximum tokens the model handles in one session (prompt + generation combined). Common: 8k, 32k, 128k, 1M. |
| Context window | Same as context length. |
Rule of thumb: 1 token ≈ 3/4 of an English word. 128k tokens ≈ 96,000 words ≈ 300 pages.
Dense vs. Mixture of Experts (MoE)
Two architectures to know:
Dense models activate every parameter for every token. A 70B dense model reads all 70 billion parameters per token. High quality, slow inference.
MoE models have many parameters but only activate a subset ("experts") per token. A router picks which experts fire.
| Term | What to know |
|---|---|
| Expert | A subset of feed-forward layers. Many exist, few activate per token. |
| Router | Decides which experts process each token. |
| Active parameters | Parameters used per token. The number that determines inference speed. |
| Total parameters | Full count including all experts. Determines download size and memory. |
| Top-k routing | Activating k most relevant experts per token. Common: top-2, top-4, top-8. |
Reading the name: XB-AYB = X billion total, Y billion active per token. Qwen3 35B-A3B has 35B total but activates 3B per token, giving 3B-class speed with much higher quality.
Why this matters on Apple Silicon: LLM inference is memory bandwidth bound. MoE models only read their active parameters from memory each token, so a 35B-A3B model achieves the throughput of a 3B dense model. On bandwidth-rich Apple Silicon (400+ GB/s), this is the optimal architecture for local inference.
Quantization
Quantization reduces weight precision from the original 16 bits per parameter to fewer bits. Smaller model, less memory, faster inference, minimal quality loss.
Precision formats
| Format | Bits | Memory per 1B params | When to use |
|---|---|---|---|
| fp32 | 32 | 4 GB | Training only. Never for inference. |
| fp16 / bf16 | 16 | 2 GB | Baseline "unquantized" inference format. |
| Q8_0 | 8 | ~1 GB | Nearly lossless. Use when memory allows. |
| Q6_K | ~6.5 | ~0.8 GB | High quality, minimal loss. |
| Q5_K_M | ~5.5 | ~0.7 GB | Good quality/size balance. |
| Q4_K_M | ~4.5 | ~0.56 GB | The sweet spot for most users. Start here. |
| Q4_0 | 4 | ~0.5 GB | Basic 4-bit. Noticeable loss on small models. |
| Q3_K_M | ~3.5 | ~0.44 GB | Aggressive. Quality degrades. |
| Q2_K | ~2.5 | ~0.31 GB | Extreme. Significant quality loss. Last resort. |
| MXFP4 | 4 | ~0.5 GB | Microscaling FP4. Newer format with better numerics than Q4_0. |
Quantization methods and formats
| Method | What to know |
|---|---|
| GGUF | The file format for llama.cpp. Single file with embedded metadata (tokenizer, config). The standard for local inference. Use this. |
| GPTQ | Post-training quantization with calibration data. Accurate 4-bit. Used with GPU servers (vLLM, TGI). |
| AWQ | Activation-Aware Weight Quantization. Faster to apply than GPTQ, often slightly better. |
| EXL2 | Variable bit-rate quantization for ExLlamaV2. Allocates more bits to important layers. |
| GGML | Predecessor to GGUF. Deprecated. Convert any GGML files you find. |
| UD (Ultra Dense) | Unsloth's importance-weighted quantization. UD-Q5_K_XL allocates extra bits to critical layers. |
How to read a quantization name
QK breaks down as:
- Q = quantized
- = approximate bits per weight (4, 5, 6, 8)
- K = K-quant method (importance-based mixed precision)
- = S (small), M (medium), L (large), XL (extra-large). Larger = more bits on important layers.
Q5_K_M = 5-bit, K-quant method, medium variant.
GGUF vs. MLX: model formats for Apple Silicon
Two ecosystems for running models on Mac: GGUF (llama.cpp) and MLX (Apple's framework). Same goal, different approaches.
| Aspect | GGUF (llama.cpp) | MLX |
|---|---|---|
| Developer | Georgi Gerganov + community | Apple |
| Backend | C/C++ with Metal compute shaders | Python/C++ with Metal, native Apple framework |
| Model format | Single .gguf file with embedded metadata | Directory of .safetensors + JSON config |
| Quantization | K-quants (Q4_K_M, Q5_K_XL, etc.) | Linear quantization (4-bit, 8-bit) |
| Strengths | Mature, battle-tested, huge model library, best raw perf on Apple Silicon, speculative decoding, advanced KV cache | Pythonic API, easy to modify, native Apple integration, better for research and fine-tuning |
| Weaknesses | C++ codebase harder to hack on | Younger ecosystem, fewer pre-quantized models, generally slower inference |
| Best for | Production inference, serving, maximum throughput | Research, prototyping, fine-tuning on Mac |
| Server mode | llama-server (OpenAI-compatible API) | mlx_lm.server (OpenAI-compatible API) |
| Model source | Hugging Face (search "GGUF") | Hugging Face (search "MLX") |
Decision rule: for inference speed and model serving, use llama.cpp with GGUF. For experimentation, fine-tuning, or custom pipelines in Python, use MLX.
Start a llama.cpp server with an OpenAI-compatible API:
# Basic server (auto-detects Apple Silicon Metal backend)
llama-server \
-m model.gguf \
--port 8080
# With all the recommended flags for production use
llama-server -m model.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \
-fa \
--jinja \
-c 32768 \
-ctk q8_0 \
-ctv q8_0Test the API with curl:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [
{"role": "user", "content": "Hello"}
]
}'Many power users keep both: llama.cpp for daily model serving, MLX for trying architectures or quick fine-tunes.
Fine-tuning
Fine-tuning continues training a pretrained model on a specific dataset to change its behavior.
Methods
| Method | What to know |
|---|---|
| Full fine-tuning | Updates all parameters. Same memory as training from scratch. Best quality, most expensive. |
| LoRA | Low-Rank Adaptation. Freezes original weights, trains small adapter matrices (0.1% to 1% of parameters). Dramatically cheaper. Start here. |
| QLoRA | Loads base model in 4-bit, trains LoRA on top. Fine-tune 70B models on 24 GB VRAM. |
| DoRA | Weight-Decomposed LoRA. Separates magnitude from direction. Drop-in upgrade over LoRA. |
Alignment and preference tuning
After pretraining, models are aligned to be helpful, harmless, and honest:
| Method | What to know |
|---|---|
| SFT | Supervised Fine-Tuning on curated instruction/response pairs. First alignment step. |
| RLHF | Reinforcement Learning from Human Feedback. Trains a reward model, then optimizes against it. Made ChatGPT possible. |
| DPO | Direct Preference Optimization. RLHF results without the reward model. Simpler, more stable. |
| ORPO | Odds Ratio Preference Optimization. Combines SFT + preference alignment in one step. |
| GRPO | Group Relative Policy Optimization. Samples multiple responses, uses relative quality. Used by DeepSeek. |
Model stages
| Stage | What it means |
|---|---|
| Base model | Raw pretrained model. Good at text completion, bad at following instructions. |
| Instruct model | Fine-tuned to follow instructions. What most people mean by "the model." |
| Chat model | Further tuned for multi-turn conversation with a specific chat template. |
| Censor/uncensored | Whether the model refuses certain request categories. |
Distillation
Knowledge distillation trains a smaller "student" model to mimic a larger "teacher" model. The student learns from the teacher's output probabilities, which carry richer information than simple correct/incorrect labels.
| Term | What to know |
|---|---|
| Teacher model | The large, high-quality source. Often a frontier model (GPT-4, Claude). |
| Student model | The smaller model learning to replicate the teacher. |
| Logit distillation | Student matches the teacher's full probability distribution, not just the top prediction. Transfers "soft" knowledge about alternatives the teacher considered. |
| On-policy distillation | Student generates responses, teacher scores them. More effective than training on teacher outputs directly. |
| Synthetic data distillation | Teacher generates a training dataset, student trains on it with standard SFT. The most common approach. |
When a model card says "distilled from" or "trained on synthetic data from" a larger model, this is what happened. Many Llama and Qwen variants are distilled from larger models in the same family or from proprietary models.
Chat templates and prompt formatting
LLMs require specific chat templates that structure conversations into roles (system, user, assistant). Wrong template = garbage output, even from a good model.
| Term | What to know |
|---|---|
| Chat template | The exact text format for multi-turn conversation. Defines delimiters for system prompts, user messages, assistant responses. |
| ChatML | Uses <|im_start|> and <|im_end|> tags. Used by Qwen and many community models. |
| Llama format | Uses <|begin_of_text|>, <|start_header_id|>, <|end_header_id|>. Varies slightly between versions. |
| Jinja template | Template engine in the tokenizer config that auto-formats messages. Most modern models include one. |
| System prompt | Sets behavior, personality, or constraints at conversation start. Not supported by all models. |
| Special tokens | Tokens like <eos>, <pad>, or template delimiters with special meaning. Not generated as regular text. |
| BOS / EOS | Beginning/End of Sequence tokens. Mark text boundaries. |
In practice: always use the --jinja flag in llama.cpp for automatic template detection. Manual template specification is error-prone. Let the model's embedded metadata handle it.
# Auto-detect the chat template from model metadata
llama-server \
-m model.gguf \
--jinja
# Override the system prompt
llama-cli -m model.gguf \
--jinja \
-sys "You are a helpful coding assistant." \
-p "Write a Python function to merge two sorted lists."Reasoning and thinking models
Reasoning models produce an internal chain of thought before answering. More tokens spent thinking, better answers on hard problems.
| Term | What to know |
|---|---|
| Chain-of-Thought (CoT) | Model shows reasoning step by step before the final answer. Major accuracy gains on math, logic, and coding. |
| Reasoning/thinking tokens | Tokens generated during internal reasoning. Some models (DeepSeek R1, QwQ) show them; others hide them. |
| Thinking budget | Configurable limit on thinking tokens. Higher = better quality, more latency and cost. |
| o1-style reasoning | Model develops reasoning strategies via reinforcement learning during training (not taught explicit patterns). Named after OpenAI's o1. |
| Hybrid thinking | Toggle between fast (no thinking) and slow (extended reasoning) per query. Qwen3 supports /think and /no_think modes. |
When to use thinking: complex coding, math, multi-step reasoning. When to skip it: simple questions, creative writing, straightforward lookups. The thinking overhead is wasted on easy tasks.
Tool calling and agentic use
Tool calling lets the model invoke external functions by generating structured output (usually JSON). The host executes the function and returns results.
| Term | What to know |
|---|---|
| Tool calling / function calling | Model outputs JSON requesting a function call with arguments. Host executes, returns result to model. |
| Tool use loop | Model requests tool → host executes → result fed back → model continues. Can repeat many times per response. |
| MCP (Model Context Protocol) | Anthropic's open standard for connecting LLMs to tools and data sources. Standardized tool, resource, and prompt exposure. |
| Agentic workflow | LLM as autonomous agent: plan, act via tools, observe results, iterate. Multi-step task execution beyond single-turn chat. |
| ReAct | Reasoning + Acting pattern. Model alternates between reasoning about what to do and acting via tool calls. |
| Structured output | Constraining output to a schema (JSON, XML) for reliable machine parsing. Essential for tool calling. |
Check before relying on it: not all models handle tool calling well. It requires specific training. Look for explicit "tool calling" or "function calling" in the model card.
RAG (Retrieval-Augmented Generation)
RAG augments the model's prompt with retrieved documents before generation. Reduces hallucination, enables access to private or current data, no fine-tuning required.
| Term | What to know |
|---|---|
| RAG | Retrieve relevant context, inject it into the prompt, then generate. The standard pattern for private/current data. |
| Embedding model | Converts text to dense vectors. Similar texts produce similar vectors. Separate from the generative LLM. |
| Vector database | Stores and searches embeddings by similarity. Qdrant, Chroma, Pinecone, pgvector. |
| Chunking | Splitting documents into pieces for embedding. Chunk size and overlap significantly affect retrieval quality. |
| Semantic search | Finding documents by meaning, not keywords. Uses embedding similarity. |
| Hybrid search | Combines semantic search (embeddings) with keyword search (BM25). Usually outperforms either alone. |
| Reranking | Second-pass model that rescores retrieved documents. Improves precision after initial retrieval. |
Multimodal and vision models
VLMs (Vision Language Models) process both text and images. Some handle audio, video, or other modalities.
| Term | What to know |
|---|---|
| VLM | LLM that accepts images alongside text. Describes images, answers questions, reads text in photos. |
| Vision encoder | Processes images into embeddings the language model understands. Usually SigLIP or ViT based. |
| Image tokens | Images convert to token sequences (often hundreds per image). These consume context length like text tokens. |
| OCR capability | Many VLMs read text in images (receipts, screenshots, documents) without a separate OCR system. |
| Examples | Qwen-VL, LLaVA, Gemma with vision, Llama with vision adapters. Most major families now have vision variants. |
For local inference, VLMs work with llama.cpp (--mmproj for vision adapter) and MLX. Expect higher memory usage than text-only models due to the vision encoder.
# Run a vision model (e.g., Gemma or Qwen-VL)
llama-cli -m vision-model.gguf \
--mmproj vision-encoder.gguf \
--jinja \
--image photo.jpg \
-p "What is in this image?"Licenses
Model licenses determine what you can do with the weights. Check before deploying.
| License | What it allows |
|---|---|
| Apache 2.0 | Fully permissive. Any use including commercial. No restrictions. Mistral, some Qwen. |
| MIT | Same as Apache 2.0 in practice. Fully permissive. |
| Llama Community License | Free for most uses. Companies with 700M+ MAU need separate license. Commercial OK for most orgs. |
| Qwen License | Generally permissive for research and commercial. Some restrictions on training competing models. Varies by version. |
| Gemma Terms of Use | Commercial OK with redistribution restrictions. |
| CC-BY-NC | Research and personal only. No commercial use. |
| Research only | Strictly research. No commercial use. |
"Open weights" does not mean "open source." Always check the model card on Hugging Face before commercial deployment.
Inference terminology
Inference = running a trained model to generate output. Know these metrics and concepts.
Speed metrics
| Term | What it measures |
|---|---|
| t/s | Tokens per second. The primary speed measurement. |
| pp (prompt processing) | Speed of processing the input prompt. Also called "prefill." |
| tg (token generation) | Speed of generating new tokens. Always slower than pp (sequential). |
| pp512, pp65536 | Prompt processing speed at 512 or 65,536 tokens. Measures context scaling. |
| tg128, tg1024 | Token generation speed at 128 or 1,024 output tokens. |
| TTFT | Time To First Token. Delay before generation starts. Driven by prompt processing speed. |
| Throughput | Total t/s across all concurrent users (server metric). |
Speculative decoding and MTP
Standard inference generates one token per forward pass (sequential bottleneck). These techniques draft multiple tokens to break through it.
| Technique | How it works |
|---|---|
| Speculative decoding (draft model) | Small, fast model drafts candidate tokens. Large model verifies them all in one pass. Accepted tokens are free. Typical speedup: 1.5 to 2.5x. |
| MTP (Multi-Token Prediction) | Built-in prediction heads draft 2 to 3 tokens per pass. No separate model needed. Expected speedup: 1.5 to 1.8x. |
| Eagle | Trains a lightweight draft head on top of the target model's hidden states. Faster than a separate draft model. |
| Lookahead decoding | Uses the model's own n-gram patterns to speculate without any draft model. |
| Medusa | Adds parallel prediction heads, each predicting a different future token position. |
Key property: speculative decoding never changes output quality (rejected drafts are discarded). Costs extra memory for the draft model or heads. On memory-rich Apple Silicon, almost always worth enabling.
# Speculative decoding with a draft model (1.5 to 2.5x speedup)
llama-server -m large-model.gguf \
-md small-draft-model.gguf \
--draft-max 16 \
--draft-min 4 \
-ngl 99 \
-fa \
--jinjaSampling parameters
The model outputs a probability distribution over its vocabulary. These parameters control token selection:
| Parameter | What it does |
|---|---|
| Temperature | Controls randomness. 0 = greedy (always pick most likely). 1 = proportional sampling. Higher = more creative/chaotic. |
| Top-p (nucleus) | Only consider tokens whose cumulative probability reaches p. 0.9 = smallest set covering 90% probability. |
| Top-k | Only consider the k most likely tokens. 40 = choose from top 40. |
| Repetition penalty | Penalizes already-used tokens. Above 1.0 increases the penalty. |
| Min-p | Only consider tokens with probability ≥ min-p × highest token's probability. Simpler alternative to top-p. |
# Control sampling in llama-cli
llama-cli -m model.gguf \
--jinja \
--temp 0.7 \
--top-p 0.9 \
--top-k 40 \
--min-p 0.05 \
--repeat-penalty 1.1 \
-p "Explain quantum computing in simple terms."
# Greedy decoding (deterministic, best for code/factual tasks)
llama-cli -m model.gguf \
--jinja \
--temp 0 \
-p "Write a function that reverses a linked list."Memory concepts
| Term | What to know |
|---|---|
| KV cache | Stores intermediate attention computations to avoid recalculation per token. Grows linearly with context length. |
| KV cache quantization | Compress the cache (e.g., fp16 → q8_0 or q4_0). Enables longer context. |
| VRAM | GPU memory. The primary bottleneck on NVIDIA hardware. |
| Unified memory | Apple Silicon: CPU and GPU share the same memory pool. Eliminates the VRAM wall. |
| Offloading | Moving model layers to CPU RAM or disk when GPU memory is insufficient. Slower, but runs larger models. |
| ngl (n-gpu-layers) | Number of layers on GPU. Set to 99 to offload everything to GPU. |
# Full GPU offload with KV cache quantization for long context
llama-server -m model.gguf \
-ngl 99 \
-fa \
-c 131072 \
-ctk q8_0 \
-ctv q8_0
# Partial offload when the model doesn't fully fit in memory
llama-server -m huge-model.gguf \
-ngl 20 \
-fa \
-c 8192Local inference software
| Tool | When to use it |
|---|---|
| llama.cpp | Default choice. C/C++ with Metal (Apple) and CUDA (NVIDIA). GGUF format. Best performance. |
| Ollama | Getting started quickly. User-friendly llama.cpp wrapper. Manages downloads, serves API. |
| vLLM | High-throughput GPU serving. PagedAttention, production-grade. NVIDIA only. |
| ExLlamaV2 | Fast NVIDIA inference with EXL2 quantization. Excellent speed at low bit rates. |
| MLX | Apple Silicon native. Python-first. Growing ecosystem. Best for research/fine-tuning on Mac. |
| TGI | Hugging Face's production server. GPTQ, AWQ support. |
| koboldcpp | llama.cpp fork for creative writing/roleplay. Extra samplers and UI features. |
Model naming conventions
Official names
Format: Organization ModelFamily-Size-Variant
| Component | Examples | Meaning |
|---|---|---|
| Organization | Qwen, Meta, Google, Mistral | Who made it |
| Family | Llama, Gemma, Qwen, Mistral | Model series |
| Size | 7B, 70B, 405B | Parameter count (billions) |
| Variant | Instruct, Chat, Code, Math | Specialization |
| Version | 3.1, 3.5, 4 | Release version |
Reading examples:
- Llama-3.1-70B-Instruct = Meta, Llama v3.1, 70B params, instruction-tuned
- Qwen3-235B-A22B = Qwen v3, 235B total / 22B active (MoE)
- Gemma-4-27B-IT = Google Gemma v4, 27B params, Instruction Tuned
Community quantization names
Format: ModelName-Quant.gguf
Example: Qwen3-35B-A3B-UD-Q5_K_XL.gguf
Decoded: Qwen3, 35B total / 3B active, Unsloth Dense quantization, Q5 K-quant extra-large variant, GGUF format.
The Hugging Face ecosystem
Hugging Face is the central hub for open-source models. Know these conventions:
| Term | What to know |
|---|---|
| Model card | The README for a model. Architecture, training data, benchmarks, license, usage. Read this before downloading. |
| safetensors | Standard weight file format. Safer and faster than PyTorch .bin. Used by MLX and most frameworks. |
| Model repo | Git repo containing weights, config, tokenizer, metadata. |
| Spaces | Hosted demo apps. Try models before downloading. |
| Transformers library | Hugging Face's Python library. Reference implementation, not speed-optimized. |
| GGUF on HF | Community members (Unsloth, bartowski, others) upload pre-quantized GGUF files. Search "GGUF" in model names. |
Download a GGUF model from Hugging Face:
# Install the CLI if you haven't
pip install huggingface-hub
# Download a specific quantization
huggingface-cli download \
unsloth/Qwen3-30B-A3B-GGUF \
Qwen3-30B-A3B-UD-Q4_K_XL.gguf \
--local-dir ./modelsBenchmarks and evaluation
| Benchmark | What it tests |
|---|---|
| MMLU | 57 academic subjects, elementary to professional. Broad knowledge. |
| HumanEval | Code generation from docstrings. Pass@1 success rate. |
| SWE-bench | Fixes real GitHub issues. The gold standard for coding ability. |
| MATH | Competition-level math. Tests reasoning. |
| GSM8K | Grade school multi-step arithmetic. Easier than MATH. |
| ARC | Grade school science requiring reasoning. |
| HellaSwag | Commonsense sentence completion. |
| TruthfulQA | Truthful answers vs. common misconceptions. |
| Perplexity | How "surprised" the model is by text. Lower = better. Raw quality metric, doesn't predict usefulness directly. |
| Chatbot Arena / ELO | Human preference from blind A/B tests. Most ecologically valid benchmark. |
Putting it all together
When someone says "I'm running Qwen3 35B-A3B UD-Q5_K_XL on llama.cpp with Flash Attention and q8_0 KV cache at 128k context," decode it:
- Qwen3: Qwen family, version 3
- 35B-A3B: 35 billion total, 3 billion active per token (MoE with GQA)
- UD-Q5_K_XL: Unsloth Dense quantization, ~5.5 bits per weight, extra-large variant, GGUF format
- llama.cpp: inference engine, C/C++ with Metal backend on Apple Silicon
- Flash Attention: tiled attention for memory efficiency, enabled with
-fa - q8_0 KV cache: attention cache compressed to 8-bit, halving its memory footprint
- 128k context: up to 128,000 tokens per session, using RoPE for position encoding
That setup: ~25 GB in memory, ~99 t/s on an M5 Max, entire codebase fits in context. Once you know the terminology, you can evaluate any model configuration at a glance.
The full command:
llama-server \
-m Qwen3-35B-A3B-UD-Q5_K_XL.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \
-fa \
--jinja \
-c 131072 \
-ctk q8_0 \
-ctv q8_0Terminology current as of mid-2026. The field moves fast. When in doubt, check the model card on Hugging Face and the docs for your inference engine.