Running non-quantized Hugging Face LLMs locally in Ollama —
example: converting Hermes-4-14B into a quantized GGUF model (hermes4-14b.Q4_K_M.gguf).
Using Visual Studion code v.1.105.1 - vscode and macOS Tahoe 26.0.1 in this text.
From huggingface login and create a token
Start macos terminal.
Replace hf_your_token_here with your newly created token from huggingface
~/> export HUGGING_FACE_HUB_TOKEN=hf_your_token_here
~/> git clone https://github.com/ggerganov/llama.cpp
~/> cd llama.cpp
Open direcotory in vscode with
~/llama.cpp> code .
In vscode click on any .py file.
Click in the down right corner on where it says e.g python 3.13.7....
Click on +Create virtual environment in the popup list
Select venv - last in the list
Select Custom in the list
Select the python version you want to create a virtual environment of e.g 3.11.13
(don´t select the latest version because a lot of python packages is not compatible with the latest versions)
Select the name venv and press enter
Select Skip package installation
Done. You have now a virtual python environment
In vscode, select New Terminal under Terminal menu
You will now see that it says (.venv) in the terminal
It means that if you install python packages with e.g. pip install, it will be only installed in the .venv environment and will not affect other python environments.
In vscode terminal:
(.venv) ~/llama.cpp> python3 -m pip install transformers sentencepiece protobuf safetensors huggingface_hub mistral-common torch
(.venv) ~/llama.cpp> mkdir build
(.venv) ~/llama.cpp> cd build
(.venv) ~/llama.cpp/build> brew install cmake
(.venv) ~/llama.cpp/build> cmake ..
(.venv) ~/llama.cpp/build> cmake --build . --config Release
Download Hermes-4-14B från huggingsface and create quantized version:
(.venv) ~/llama.cpp/build> cd ../../
(.venv) ~/> brew install git-lfs
(.venv) ~/> git lfs install
(.venv) ~/> git clone https://huggingface.co/NousResearch/Hermes-4-14B
(.venv) ~/> cd Hermes-4-14B
(.venv) ~/Hermes-4-14B> git lfs pull
(.venv) ~/Hermes-4-14B> cd ../llama.cpp/
(.venv) ~/llama.cpp> python3 convert_hf_to_gguf.py ../Hermes-4-14B --outfile hermes4-14b-fp16.gguf
(.venv) ~/llama.cpp> ./build/bin/llama-quantize hermes4-14b-fp16.gguf hermes4-14b.Q4_K_M.gguf Q4_K_M
(.venv) ~/llama.cpp> vi Modelfile
FROM ./hermes4-14b.Q4_K_M.gguf
TEMPLATE """
<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
(.venv) ~/llama.cpp> ollama create hermes4local -f Modelfile
(.venv) ~/llama.cpp> ollama run hermes4local