Macs are popular with (non-ML) developers, and the combination of (potentially) large amounts of unified GPU memory and decent memory bandwidth are appealing. llama.cpp started as a project to run inference of LLaMA models on Apple Silicon (CPUs).

For non-technical users, there are several “1-click” methods that leverage llama.cpp:

Nomic’s GPT4All - a Mac/Windows/Linux installer, model downloader, has a GUI, CLI, and API bindings
Ollama - a brand new project with a slightly nicer chat window

NOTE: One important note is that while it’s possible to use Macs for inference, if you’re tempted to buy one primarily to use for LLMs (eg, a Mac Studio with 192GiB of RAM will cost about the same as a 48GB Nvidia A6000 Ada so seems like a good deal), be aware that Macs have some severe issues/limitations atm:

When context becomes full, llama.cpp currently suffers huge slowdowns that manifest as multi-second pauses (computation falls back to CPU). If your goal is simply to run inference (chat with) the largest public models, you will get much better performance with say, 2 x 24GB RTX 3090s (~ $1500 u se d) or a s in g l e RTX A 600048 GB ($ 4000).
If you are planning on using Apple Silicon for ML/training, I’d also be wary. There are multi-year long open bugs in PyTorch, and most major LLM libs like bitsandbytes have no Apple Silicon support

Inference

llama.cpp

llama.cpp is a breeze to get running without any additional dependencies:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# where 8 is your threads for faster compiles
make clean && make LLAMA_METAL=1 -j8

Grab any Llama compatible GGML you want to try (you can start here). I recommend q4_K_M as the sweet spot for quantize if you don’t know which one to get.

You can run a simple benchmark to check for output and performance (most LLaMA 1 models should be -c 2048):

./main -m  ~/models/llama-2-7b-chat.ggmlv3.q4_K_M.bin -ngl 1 -c 4096 -n 200 --ignore-eos

You can then run the built in web server and be off chatting at http://localhost:8080/:

./server -c 4096 -ngl 1 -m ~/models/llama-2-7b-chat.ggmlv3.q4_K_M.bin

If you are benchmarking vs other inference engines, I recommend using these standard settings:

./main -m <model> -ngl 1 -n 2048 --ignore-eos

Metal uses -ngl 1 (or any really) since it’s unified memory, but for CUDA systems you’d want something like -ngl 99 to get all layers in memory
Default prompt context is 512 - this is probably fine to leave as is? Most testing I’ve seen online doesn’t change this
-n should be the max context you want to test to and --ignore-eos is required so it doesn’t end prematurely (as context gets longer, speed tends to slow down Here is a discussion that tracks the performance of various Apple Silicon chips:

https://github.com/ggerganov/llama.cpp/discussions/4167

MLC LLM

MLC LLM is an implementation that runs not just on Windows, Linux, and Mac, but also iOS, Android, and even in web browsers w/ WebGPU support. Assuming you have conda setup already, the instructions for installing are up to date and work without hitches.

Currently, the performance is about 50% slower than llama.cpp on my M2 MBA.

Fine Tuning

MLX

A simple guide to local LLM fine-tuning on a Mac with MLX
- https://www.reddit.com/r/LocalLLaMA/comments/191s7x3/a_simple_guide_to_local_llm_finetuning_on_a_mac/
- https://apeatling.com/articles/simple-guide-to-local-llm-fine-tuning-on-a-mac-with-mlx/

📖 llm-tracker

Explorer

Apple Silicon Macs

Inference

llama.cpp

MLC LLM

Fine Tuning

MLX

Table of Contents

Backlinks