Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ggml-org/llama.cpp/llms.txt
Use this file to discover all available pages before exploring further.
Performance Tuning
Optimize llama.cpp inference performance across CPU, GPU, and hybrid configurations.Quick Wins
Use GPU
Offload layers to GPU with
--n-gpu-layersOptimize Threads
Set
--threads to physical CPU coresChoose Quantization
Use Q4_K_M or Q5_K_M for best speed/quality
Adjust Context
Reduce
--ctx-size to minimum neededGPU Acceleration
CUDA (NVIDIA)
Offload layers to GPU:Metal (Apple Silicon)
Metal is enabled by default on macOS:ROCm (AMD)
Thread Configuration
Finding Optimal Thread Count
Start conservative:- CPU-only: Physical CPU cores (not logical/hyperthreaded)
- With GPU: 4-8 threads regardless of core count
- Server (parallel requests): 2-4 threads per request
Check physical core count
Check physical core count
- Linux
- macOS
- Windows
Batch Thread Configuration
Separate threads for prompt processing:Context Size Optimization
Context size directly impacts:- Memory usage (RAM/VRAM)
- Inference speed
- Maximum conversation length
Only use large context (>4096) when absolutely necessary. Most tasks work well with 2048.
Batch Size Tuning
Logical batch size (prompt processing parallelism):- Larger batch = faster prompt processing, more memory
- CPU: 512-2048
- GPU: 512-2048 (depends on VRAM)
- Server: 2048+ for parallel requests
Flash Attention
Enables more efficient attention computation:Flash Attention is enabled by default (
auto) when beneficial. Explicitly enable with --flash-attn on.Quantization Selection
| Quantization | Speed | Quality | Use Case |
|---|---|---|---|
| Q2_K | Fastest | Lowest | Experimentation |
| Q3_K_M | Very Fast | Low | Resource-constrained |
| Q4_K_M | Fast | Good | Recommended default |
| Q5_K_M | Moderate | Very Good | Quality-focused |
| Q6_K | Slower | Excellent | Near-original quality |
| Q8_0 | Slowest | Highest | Reference/evaluation |
Benchmark Example
Real-world benchmark on NVIDIA A6000 (48GB VRAM), 7-core CPU, 30B Q4_0 model:| Configuration | Tokens/sec |
|---|---|
| GPU only, wrong threads | <0.1 |
CPU only (-t 7) | 1.7 |
| GPU + 1 thread | 5.5 |
| GPU + 7 threads | 8.7 |
| GPU + 4 threads | 9.1 |
Hybrid CPU+GPU Inference
For models larger than VRAM:- 40 layers on GPU
- Remaining layers on CPU
Memory Optimization
Memory Mapping
Enable mmap (default, recommended):Memory Locking
Prevent swapping (requires sufficient RAM):Server Performance
Parallel Request Handling
--n-parallel: Number of simultaneous requests (2-8)--threads: Threads per request (2-4 recommended)--batch-size: Must be ≥ ctx-size × n-parallel
Continuous Batching
Enabled by default, improves throughput:Platform-Specific Tips
- NVIDIA GPU
- Apple Silicon
- AMD GPU (ROCm)
- CPU-Only
Optimal configuration:Multi-GPU:
Profiling and Monitoring
Built-in Performance Stats
Enable timing information:- Prompt evaluation time
- Token generation time
- Tokens per second
Server Metrics
Query server metrics endpoint:- Request counts
- Processing times
- KV cache usage
- Queue statistics
Benchmark Tool
Systematic performance testing:Common Performance Issues
Very slow generation (<1 tok/s)
Very slow generation (<1 tok/s)
Likely causes:
- Too many threads (oversaturation)
- No GPU acceleration
- Context size too large
- Set
--threads 1and gradually increase - Enable GPU layers:
--n-gpu-layers 32 - Reduce context:
--ctx-size 2048
Out of memory errors
Out of memory errors
Solutions:
- Use smaller quantization (Q4_K_M instead of Q8_0)
- Reduce context size:
--ctx-size 1024 - Reduce batch size:
--batch-size 256 - Offload fewer layers:
--n-gpu-layers 20 - Enable mmap:
--mmap
GPU underutilized
GPU underutilized
Check:
- Are layers offloaded? (check startup logs)
- Is batch size large enough? Try 512 or 1024
- Are you using optimal quantization? (Q4_K_M recommended)
Server slow with multiple requests
Server slow with multiple requests
Solutions:
- Increase
--n-parallel 8 - Ensure batch size ≥ ctx-size × n-parallel
- Reduce per-request threads:
--threads 2 - Enable continuous batching:
--cont-batching
Advanced Optimizations
CPU Affinity
Bind threads to specific cores:Process Priority
Increase process priority:-1 (low), 0 (normal), 1 (medium), 2 (high), 3 (realtime)
Polling Level
Reduce latency with busy-waiting:Next Steps
Quantization Guide
Learn about quantization types and tradeoffs
Backend Configuration
Configure GPU backends for your hardware
Benchmarking
Measure and compare performance
Server Tuning
Optimize server for production

