Macs are popular with (non-ML) developers, and the combination of (potentially) large amounts of unified GPU memory and decent memory bandwidth are appealing. llama.cpp started as a project to run inference of LLaMA models on Apple Silicon (CPUs).
For non-technical users, there are several “1-click” methods that leverage llama.cpp:
- Nomic’s GPT4All - a Mac/Windows/Linux installer, model downloader, has a GUI, CLI, and API bindings
- Ollama - a brand new project with a slightly nicer chat window
NOTE: One important note is that while it’s possible to use Macs for inference, if you’re tempted to buy one primarily to use for LLMs (eg, a Mac Studio with 192GiB of RAM will cost about the same as a 48GB Nvidia A6000 Ada so seems like a good deal), be aware that Macs have some severe issues/limitations atm:
- When context becomes full, llama.cpp currently suffers huge slowdowns that manifest as multi-second pauses (computation falls back to CPU). If your goal is simply to run inference (chat with) the largest public models, you will get much better performance with say, 2 x 24GB RTX 3090s (~4000).
- If you are planning on using Apple Silicon for ML/training, I’d also be wary. There are multi-year long open bugs in PyTorch, and most major LLM libs like bitsandbytes have no Apple Silicon support
llama.cpp is a breeze to get running without any additional dependencies:
git clone https://github.com/ggerganov/llama.cpp
# where 8 is your threads for faster compiles
make clean && make LLAMA_METAL=1 -j8
Grab any Llama compatible GGML you want to try (you can start here). I recommend q4_K_M as the sweet spot for quantize if you don’t know which one to get.
You can run a simple benchmark to check for output and performance (most LLaMA 1 models should be
./main -m ~/models/llama-2-7b-chat.ggmlv3.q4_K_M.bin -ngl 1 -c 4096 -n 200 --ignore-eos
You can then run the built in web server and be off chatting at http://localhost:8080/:
./server -c 4096 -ngl 1 -m ~/models/llama-2-7b-chat.ggmlv3.q4_K_M.bin
If you are benchmarking vs other inference engines, I recommend using these standard settings:
./main -m <model> -ngl 1 -n 2048 --ignore-eos
- Metal uses
-ngl 1(or any really) since it’s unified memory, but for CUDA systems you’d want something like
-ngl 99to get all layers in memory
- Default prompt context is 512 - this is probably fine to leave as is? Most testing I’ve seen online doesn’t change this
-nshould be the max context you want to test to and
--ignore-eosis required so it doesn’t end prematurely (as context gets longer, speed tends to slow down Here is a discussion that tracks the performance of various Apple Silicon chips:
MLC LLM is an implementation that runs not just on Windows, Linux, and Mac, but also iOS, Android, and even in web browsers w/ WebGPU support. Assuming you have
conda setup already, the instructions for installing are up to date and work without hitches.
Currently, the performance is about 50% slower than llama.cpp on my M2 MBA.
- A simple guide to local LLM fine-tuning on a Mac with MLX