2023-11-13 Testing:

SystemTotal Memory
8 x V100-16128GB
8 x L4-24192GB

Inference (70B llama-bench)


HardwareModelMemoryPromptInferenceprompt t/stoken/s
A100-80 SXMAiroboros 3.1.2 70B q4_K_M---3968128628.9218.40
4xV100-16 SXMAiroboros 3.1.2 70B q4_K_M46286396812842.3116.50
3xL4-24 SXMAiroboros 3.1.2 70B q4_K_M45750396812874.7410.62
8xV100-16 SXMAiroboros 3.1.2 70B q4_K_M50298396812826.709.55
8xL4-24 SXMAiroboros 3.1.2 70B q4_K_M49324396812841.298.74


HardwareModelMemoryPromptInferenceprompt t/stoken/s
A100-80 SXMAiroboros 3.1.2 70B q8_0732543968128549.5717.09
6xV100-16 SXMAiroboros 3.1.2 70B q8_078996396812833.9212.73
8xV100-16 SXMAiroboros 3.1.2 70B q8_081466396812827.9210.18
4xL4-24 SXMAiroboros 3.1.2 70B q8_076916396812867.548.57
8xL4-24 SXMAiroboros 3.1.2 70B q8_080490396812841.008.51

V100 is faster batch=1, but L4 is 2-3X faster for prefill

Qwen fine tune

8xL4 = 143808 MiB usage 78%/card ds2: 101.7h

8xV100=124944 MiB usage 95%/card ds2: keeps erroring out, gave up…

axolotl fine tune


  • fp16
  • Have to disable flash attention
  • deepspeed zero3
accelerate launch -m axolotl.cli.train openhermes25-axolotl-5.yml --deepspeed axolotl/deepspeed/zero3.json

Still OOM, set sequence 81924096 and see if that helps, nope

torch.cuda.OutOfMemoryError:  CUDA out of memory. Tried to allocate 4.00 GiB. GPU 3 has a total capacty of 15.77 GiB of which 363.88 MiB
 is free. Including non-PyTorch memory, this process has 15.41 GiB memory in use. Of the allocated memory 13.72 GiB is allocated by PyTo
rch, and 148.20 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to
 avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 

I believe needs about 160GB of RAM?


For 7B we can run evals on a single card. There is no scaling penalty on the L4s, but there is on the V100


# 1xL4
real    30m8.486s
user    29m56.266s
sys     0m35.157s

# Trying 2 just b/c GPU was at 99% on 1 card
# 2xL4
real    29m32.763s
user    29m46.183s
sys     0m8.207s

# just in case, but no difference
# 8 x L4
real    29m29.640s
user    29m41.345s
sys     0m13.537s


1 X V100
real    15m40.791s
user    15m28.559s
sys     0m20.066s

2 x V100
real    20m14.239s
user    19m32.048s
sys     0m15.120s