2023-11-13 Testing:

System	Total Memory
8 x V100-16	128GB
8 x L4-24	192GB

Inference (70B llama-bench)

Q4_K_M

Hardware	Model	Memory	Prompt	Inference	prompt t/s	token/s
A100-80 SXM	Airoboros 3.1.2 70B q4_K_M	---	3968	128	628.92	18.40
4xV100-16 SXM	Airoboros 3.1.2 70B q4_K_M	46286	3968	128	42.31	16.50
3xL4-24 SXM	Airoboros 3.1.2 70B q4_K_M	45750	3968	128	74.74	10.62
8xV100-16 SXM	Airoboros 3.1.2 70B q4_K_M	50298	3968	128	26.70	9.55
8xL4-24 SXM	Airoboros 3.1.2 70B q4_K_M	49324	3968	128	41.29	8.74

Q8_0

Hardware	Model	Memory	Prompt	Inference	prompt t/s	token/s
A100-80 SXM	Airoboros 3.1.2 70B q8_0	73254	3968	128	549.57	17.09
6xV100-16 SXM	Airoboros 3.1.2 70B q8_0	78996	3968	128	33.92	12.73
8xV100-16 SXM	Airoboros 3.1.2 70B q8_0	81466	3968	128	27.92	10.18
4xL4-24 SXM	Airoboros 3.1.2 70B q8_0	76916	3968	128	67.54	8.57
8xL4-24 SXM	Airoboros 3.1.2 70B q8_0	80490	3968	128	41.00	8.51

V100 is faster batch=1, but L4 is 2-3X faster for prefill

Qwen fine tune

8xL4 = 143808 MiB usage 78%/card ds2: 101.7h

8xV100=124944 MiB usage 95%/card ds2: keeps erroring out, gave up…

axolotl fine tune

V100

fp16
Have to disable flash attention
deepspeed zero3

accelerate launch -m axolotl.cli.train openhermes25-axolotl-5.yml --deepspeed axolotl/deepspeed/zero3.json

Still OOM, set sequence 8192→4096 and see if that helps, nope

torch.cuda.OutOfMemoryError:  CUDA out of memory. Tried to allocate 4.00 GiB. GPU 3 has a total capacty of 15.77 GiB of which 363.88 MiB
 is free. Including non-PyTorch memory, this process has 15.41 GiB memory in use. Of the allocated memory 13.72 GiB is allocated by PyTo
rch, and 148.20 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to
 avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I believe needs about 160GB of RAM?

llm-jp-eval

For 7B we can run evals on a single card. There is no scaling penalty on the L4s, but there is on the V100

L4

# 1xL4
real    30m8.486s
user    29m56.266s
sys     0m35.157s

# Trying 2 just b/c GPU was at 99% on 1 card
# 2xL4
real    29m32.763s
user    29m46.183s
sys     0m8.207s

# just in case, but no difference
# 8 x L4
real    29m29.640s
user    29m41.345s
sys     0m13.537s

V100

1 X V100
real    15m40.791s
user    15m28.559s
sys     0m20.066s

2 x V100
real    20m14.239s
user    19m32.048s
sys     0m15.120s

📖 llm-tracker

Explorer

8xV100-16 vs 8xL4-24

Inference (70B llama-bench)

Q4_K_M

Q8_0

Qwen fine tune

axolotl fine tune

V100

llm-jp-eval

L4

V100

Table of Contents

Backlinks