Nvidia GPUs are the most compatible hardware for AI/ML. All of Nvidia’s GPUs (consumer and professional) support CUDA, and basically all popular ML libraries and frameworks support CUDA.

The biggest limitation of what LLM models you can run will be how much GPU VRAM you have. The r/LocalLLaMA wiki gives a good overview of how much VRAM you need for various quantized models.

Nvidia cards can run CUDA with WSL which means that generally, all software will work both in Linux and Windows. If you are serious about ML, there are still advantages to Linux like better performance, less VRAM usage (ability to run headless), and probably some other edge cases.

For inferencing you have a few options:

  • llama.cpp - As of July 2023, llama.cpp’s CUDA performance is on-par with the ExLlama, generally be the fastest performance you can get with quantized models. GGMLv3 is a convenient single binary file and has a variety of well-defined quantization levels (k-quants) that have slightly better perplexity than the most widely supported alternative, GPTQ. It however, is slightly less memory efficient, eg, potentially running OOM on 33B models on 24GiB GPUs when exllama does not.
    • llama.cpp is best for low-VRAM GPUs since you can offload layers to run on the GPU (use -ngl <x> to set layers and --low-vram to move the cache to system memory as well. The more layers you can load into VRAM, the faster your model will run.
    • llama.cpp is a huge project with many active contributors, and now has some VC backing as well
  • ExLlama - if you have an RTX 3000/4000 GPU, this is probably going to be your best option. It is on par in performance with llama.cpp, and also is the most memory efficient implementation available. If you are splitting a model between multiple GPUs, ExLLama seems to have the most efficient performance when splitting inferencing between cards.
    • ExLlama is a smaller project but contributions are being actively merged (I submitted a PR) and the maintainer is super responsive.
  • AutoGPTQ - this engine, while generally slower may be better for older GPU architectures. There is a CUDA and Triton mode, but the biggest selling point is that it can not only inference, but also quantize and fine-tune many model types.
  • MLC LLM - this was a bit of a challenge to setup, but turns out to perform quite well (perhaps better than all other engines?)

CUDA Version Hell

The bane of your existence is probably going to be managing all the different CUDA versions that are required for various libraries. Recommendations:

  • Use conda (well, mamba lest you want to grow old and die waiting for dependencies to calculate). If you don’t know where to start, just install Mambaforge directly and create a new environment for every single library.
  • Install the exact version of CUDA that you need for each environment and point to it, eg:
    conda create -n autogptq
    conda activate autogptq
    mamba install -c "nvidia/label/cuda-11.7.0" cuda-toolkit
    conda env config vars set CUDA_PATH="$CONDA_PREFIX"
    conda env config vars set CUDA_HOME="$CONDA_PREFIX"
    Additionally, if you need to use a specific g++ version (eg, CUDA 11.7 requires g++ < 12):
    # see valid versions: https://anaconda.org/conda-forge/gxx/files
    mamba install gxx=11.4.0
    And if you need to install PyTorch manually…
    mamba install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
  • This should be good enough, but if all else fails, you can use a custom Docker container as well.
  • There’s envd, a Docker addon that promises easier dev environments for AI/ML, although it also has a number of open bugs

Inferencing Packages

PackageCommitModelQuantMemory Usage4090 @ 400PL3090 @ 360PL
MLC LLM CUDA3c53eebllama2-7b-chatq4f16_15932115.8783.63
MLC LLM Perfc40be6allama2-7b-chatq4f16_15244165.57131.73


mlc-llm is an interesting project that lets you compile models (from HF format) to be used on multiple platforms (Android, iOS, Mac/Win/Linux, and even WebGPU). On PC however, the install instructions will only give you a pre-compiled Vulkan version, which is much slower than ExLLama or llama.cpp, however when a CUDA version is compiled, it looks like it’s actually possibly the fastest inferencing engine currently available (2023-08-03).

Here’s how to set up on Arch Linux

# Required
paru -S rustup
rustup default stable
paru -S llvm

# Environment
conda create -n mlc
conda activate mlc
mamba install pip

# Compile TVM
git clone https://github.com/mlc-ai/relax.git --recursive
cd relax
mkdir build
cp cmake/config.cmake build
sed -i 's/set(USE_CUDA OFF)/set(USE_CUDA ON)/g' build/config.cmake
sed -i 's/set(USE_CUDNN OFF)/set(USE_CUDNN ON)/g' build/config.cmake
sed -i 's/set(USE_CUBLAS OFF)/set(USE_CUBLAS ON)/g' build/config.cmake

make -j`nproc`
export TVM_HOME=`pwd`
cd ..

# Make model
# IMPORTANT: CUDA is targeted per GPU. Be sure to use CUDA_VISIBLE_DEVICES if you have multiple generations of CUDA...
# NOTE: the maximum context length is determined for a model at compile-time here. It defaults to 2048 so you will want to set it to a longer one if your model supports it (I don't believe MLC currently supports RoPE extension
git clone https://github.com/mlc-ai/mlc-llm.git --recursive
cd mlc-llm
python3 -m mlc_llm.build --target cuda --quantization q4f16_1 --model /models/llm/llama2/meta-llama_Llama-2-7b-chat-hf --max-seq-len 4096

# Compile mlc-llm
mkdir build && cd build
cmake .. -DUSE_CUDA=ON
make -j`nproc`
cd ..

# In `mlc-llm` folder you should now be able to run
build/mlc_chat_cli --local-id meta-llama_Llama-2-7b-chat-hf-q4f16_1 --device-name cuda --device_id 0 --evaluate --eval-prompt-len 3968 --eval-gen-len=128

Note, the main branch, as of 2023-08-03 runs at about the same speed as ExLlama and a behind llama.cpp, however there is a separate “benchmark” version that has performance optimizations that have not yet made it’s way back to the main branch. This can be found at this repo: https://github.com/junrushao/llm-perf-bench

And here’s how to get it working:

paru -S cutlass
# Otherwise GLIBCXX_3.4.32 not happy
mamba install cmake

git clone --recursive https://github.com/junrushao/mlc-llm/ --branch benchmark mlc-llm.junrushao-benchmark
cd mlc-llm.junrushao-benchmark

# Adapted from https://github.com/junrushao/llm-perf-bench/blob/main/install/tvm.sh
export MLC_HOME=`pwd`
export TVM_HOME=$MLC_HOME/3rdparty/tvm
export PYTHONPATH=$TVM_HOME/python
cd $TVM_HOME && mkdir build && cd build && cp ../cmake/config.cmake .
echo "set(CMAKE_BUILD_TYPE RelWithDebInfo)" >>config.cmake
echo "set(CMAKE_EXPORT_COMPILE_COMMANDS ON)" >>config.cmake
echo "set(USE_GTEST OFF)" >>config.cmake
echo "set(USE_CUDA ON)" >>config.cmake
echo "set(USE_LLVM ON)" >>config.cmake
echo "set(USE_VULKAN OFF)" >>config.cmake
echo "set(USE_CUTLASS ON)" >>config.cmake
cmake .. && make -j$(nproc)

# Adapted from https://github.com/junrushao/llm-perf-bench/blob/main/install/mlc.sh
cd $MLC_HOME && mkdir build && cd build && touch config.cmake
echo "set(CMAKE_BUILD_TYPE RelWithDebInfo)" >>config.cmake
echo "set(CMAKE_EXPORT_COMPILE_COMMANDS ON)" >>config.cmake
echo "set(USE_CUDA ON)" >>config.cmake
echo "set(USE_VULKAN OFF)" >>config.cmake
echo "set(USE_METAL OFF)" >>config.cmake
echo "set(USE_OPENCL OFF)" >>config.cmake
cmake .. && make -j$(nproc)

CUDA_VISIBLE_DEVICES=0 python build.py --target cuda --quantization q4f16_1 --model /models/llm/llama2/meta-llama_Llama-2-7b-chat-hf --use-cache=0

### or in the Docker container
micromamba activate python311
CUDA_VISIBLE_DEVICES=0 python build.py   --model /models/llm/llama2/meta-llama_Llama-2-7b-chat-hf   --target cuda   --quantization q4f16_1   --artifact-path "./dist"   --use-cache 0
mv dist/meta-llama_Llama-2-7b-chat-hf-q4f16_1 dist/4090-meta-llama_Llama-2-7b-chat-hf-q4f16_1

CUDA_VISIBLE_DEVICES=1 python build.py   --model /models/llm/llama2/meta-llama_Llama-2-7b-chat-hf   --target cuda   --quantization q4f16_1   --artifact-path "./dist"   --use-cache 0
mv dist/meta-llama_Llama-2-7b-chat-hf-q4f16_1 dist/3090-meta-llama_Llama-2-7b-chat-hf-q4f16_1

# copy dist folder to where you want
scp -r -P 45678 root@ ./

On my 4090, the q4f16_1 is 165.98 t/s vs 106.70 t/s for a q4 32g act-order GPTQ w/ ExLlama, and 138.83 t/s with a q4_K_M GGMLv3 with llama.cpp.

Tips and Tricks

Monitor your Nvidia GPUs with either:

watch nvidia-smi

You can lower power limits if you’re inferencing:

sudo nvidia-smi -i 0 -pl 360
  • You can get your GPU IDs with nvidia-smi -L
  • For inferencing, I can lower my 4090 from 450W to 360W and only lose about 1-2% performance but everyone should test for themselves what works best for their setup.

Mobile Power Limits

Mostly locked by VBIOS, but you might be able to run nvidia-powerd and get +10W

Brand New Ubuntu 22.04 LTS Setup


# Verify if GPU is CUDA-enabled
lspci | grep -i nvidia

# Remove previous NVIDIA driver installation
sudo apt-get purge nvidia* -y
sudo apt remove nvidia-* -y
sudo rm /etc/apt/sources.list.d/cuda* -y
sudo apt-get autoremove && sudo apt-get autoclean -y
sudo rm -rf /usr/local/cuda* -y

# System update
sudo apt-get update -y
sudo apt-get upgrade -y

# Install essential packages
sudo apt-get install g++ freeglut3-dev build-essential libx11-dev -y
sudo apt-get install libxmu-dev libxi-dev libglu1-mesa libglu1-mesa-dev -y

# Add PPA repository for NVIDIA drivers
sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt update -y

# Install NVIDIA driver and dependencies
sudo apt-get install -y cuda-drivers

# Download and set up CUDA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.3.1/local_installers/cuda-repo-ubuntu2204-12-3-local_12.3.1-545.23.08-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-3-local_12.3.1-545.23.08-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update -y

# Install CUDA Toolkit 12.3
sudo apt-get -y install cuda-toolkit-12-3

# Set up paths for CUDA
echo 'export PATH=/usr/local/cuda-12.3/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.3/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
sudo ldconfig

# Install cuDNN v11.3
wget https://developer.nvidia.com/compute/machine-learning/cudnn/secure/
tar -xzvf "cudnn-11.3-linux-x64-v8.2.1.32.tgz"

# Copy cuDNN files to CUDA toolkit directory
sudo cp -P cuda/include/cudnn.h /usr/local/cuda-12.3/include
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda-12.3/lib64/
sudo chmod a+r /usr/local/cuda-12.3/lib64/libcudnn*

# Install nvtop for monitoring
sudo apt install nvtop -y

# Verify the installation
nvcc -V