llama.cpp requires models to be in GGUF format. If you have a model in PyTorch, SafeTensors, or another format, you’ll need to convert it first.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ggml-org/llama.cpp/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The conversion process transforms model weights and metadata from Hugging Face format (or other formats) into the GGUF format used by llama.cpp.When to convert:
- You have a model in PyTorch (
.bin,.pt) or SafeTensors (.safetensors) format - You want to use a model from Hugging Face that isn’t available in GGUF
- You’ve fine-tuned a model and need to convert it for inference
- The model is already available in GGUF format on Hugging Face
- You can use a pre-converted version
Quick Start
The main conversion script isconvert_hf_to_gguf.py:
Step-by-Step Conversion Process
Obtain the Model
First, download the model in its original format from Hugging Face or another source.You should see files like:
config.jsontokenizer.json/tokenizer.modelmodel-*.safetensorsorpytorch_model-*.bin
Install Dependencies
Install the required Python packages:Key dependencies:
torch- PyTorch for loading model weightstransformers- Hugging Face transformers librarynumpy- Numerical operationsgguf- GGUF format library
Run Conversion
Convert the model to GGUF format:The script will:
- Load the model configuration
- Read model weights
- Convert tensors to GGUF format
- Save the output file
Conversion Script Reference
convert_hf_to_gguf.py
The primary conversion script for Hugging Face models.Common Options
Common Options
Output Types
Output Types
- f16 (default): 16-bit floating point - good balance of size and quality
- f32: 32-bit floating point - full precision, largest file
- bf16: BFloat16 - alternative 16-bit format, same size as f16
Other Conversion Scripts
convert_lora_to_gguf.py
convert_lora_to_gguf.py
Convert LoRA (Low-Rank Adaptation) adapters to GGUF format:Useful for fine-tuned models using the LoRA technique. See the GGUF-my-LoRA space for online conversion.
convert_llama_ggml_to_gguf.py
convert_llama_ggml_to_gguf.py
Convert old GGML format to current GGUF format:Only needed for very old llama.cpp models from before the GGUF format was introduced.
Supported Model Architectures
The conversion script automatically detects the model architecture fromconfig.json. Supported architectures include:
Advanced Conversion
Converting from ModelScope
Models from ModelScope can be converted the same way:Vocabulary-Only Conversion
For testing tokenizers or when you only need vocabulary:Custom Metadata
Embed custom metadata during conversion:llama-cli --model-info.
Online Conversion Tools
If you prefer not to set up a local environment, use these Hugging Face spaces:GGUF-my-repo
GGUF-my-repo
GGUF-my-repo - Official converter and quantizerFeatures:
- Convert any Hugging Face model to GGUF
- Automatically quantize to multiple formats
- No local setup required
- Results published to your Hugging Face account
- Visit the space
- Enter the model repository name
- Select quantization options
- Click “Submit”
- Download the resulting GGUF files
The space is synced from llama.cpp main branch every 6 hours, so it uses recent conversion code.
GGUF-my-LoRA
GGUF-my-LoRA
GGUF-my-LoRA - Convert LoRA adaptersSpecialized tool for converting LoRA fine-tuned models. See discussion for details.
Troubleshooting
ModuleNotFoundError: No module named 'torch'
ModuleNotFoundError: No module named 'torch'
Solution:
Install requirements:
Model architecture not recognized
Model architecture not recognized
Symptoms:Solutions:
- Check if your model architecture is supported in Supported Models
- Update llama.cpp to the latest version
- If it’s a new architecture, it may not be supported yet
Out of memory during conversion
Out of memory during conversion
Solution:
The conversion process loads the entire model into memory. For large models (70B+):
- Use a machine with sufficient RAM (at least 2x the model size)
- Close other applications
- Consider using the GGUF-my-repo online tool instead
Conversion is very slow
Conversion is very slow
This is normal for large models. Expected times:
- 7B model: 2-5 minutes
- 13B model: 5-10 minutes
- 70B model: 30-60 minutes
TypeError or tensor shape errors
TypeError or tensor shape errors
Solution:
Ensure you have the latest version of llama.cpp:Model formats change, and older conversion scripts may not work with newer models.
After Conversion
Once you have a GGUF file, you can:-
Use it directly if the F16 size is acceptable:
-
Quantize it to reduce size (recommended):
See Quantizing Models for details.
- Share it on Hugging Face for others to use
Example: Complete Workflow
Here’s a complete example converting and using a model:Next Steps
- Learn about Quantizing Models to reduce model size
- See Supported Models for architecture compatibility
- Read about Obtaining Models to find pre-converted GGUF files

