Skip to main content

Apple Silicon Macs

For non-technical users, there are several "1-click" methods that leverage llama.cpp:

  • Nomic's GPT4All - a Mac/Windows/Linux installer, model downloader, has a GUI, CLI, and API bindings
  • Ollama - a brand new project with a slightly nicer chat window

NOTE: One important note is that while it's possible to use Macs for inference, if you're tempted to buy one primarily to use for LLMs (eg, a Mac Studio with 192GiB of RAM will cost about the same as a 48GB Nvidia A6000 Ada so seems like a good deal), be aware that Macs have some severe issues/limitations atm:

llama.cpp

llama.cpp is a breeze to get running without any additional dependencies:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# where 8 is your threads for faster compiles
make clean && make LLAMA_METAL=1 -j8

Grab any Llama compatible GGML you want to try (you can start here). I recommend q4_K_M as the sweet spot for quantize if you don't know which one to get.

You can run a simple benchmark to check for output and performance (most LLaMA 1 models should be -c 2048):

./main -m  ~/models/llama-2-7b-chat.ggmlv3.q4_K_M.bin -ngl 1 -c 4096 -n 200 --ignore-eos

You can then run the built in web server and be off chatting at http://localhost:8080/:

./server -c 4096 -ngl 1 -m ~/models/llama-2-7b-chat.ggmlv3.q4_K_M.bin

If you are benchmarking vs other inference engines, I recommend using these standard settings:

./main -m <model> -ngl 1 -n 2048 --ignore-eos
  • Metal uses -ngl 1 (or any really) since it's unified memory, but for CUDA systems you'd want something like -ngl 99 to get all layers in memory
  • Default prompt context is 512 - this is probably fine to leave as is? Most testing I've seen online doesn't change this
  • -n should be the max context you want to test to and --ignore-eos is required so it doesn't end prematurely (as context gets longer, speed tends to slow down

MLC LLM

MLC LLM is an implementation that runs not just on Windows, Linux, and Mac, but also iOS, Android, and even in web browsers w/ WebGPU support. Assuming you have conda setup already, the instructions for installing are up to date and work without hitches.

Currently, the performance is about 50% slower than llama.cpp on my M2 MBA.