2023-11-13 Testing:
System | Total Memory |
---|---|
8 x V100-16 | 128GB |
8 x L4-24 | 192GB |
Inference (70B llama-bench)
Q4_K_M
Hardware | Model | Memory | Prompt | Inference | prompt t/s | token/s |
---|---|---|---|---|---|---|
A100-80 SXM | Airoboros 3.1.2 70B q4_K_M | --- | 3968 | 128 | 628.92 | 18.40 |
4xV100-16 SXM | Airoboros 3.1.2 70B q4_K_M | 46286 | 3968 | 128 | 42.31 | 16.50 |
3xL4-24 SXM | Airoboros 3.1.2 70B q4_K_M | 45750 | 3968 | 128 | 74.74 | 10.62 |
8xV100-16 SXM | Airoboros 3.1.2 70B q4_K_M | 50298 | 3968 | 128 | 26.70 | 9.55 |
8xL4-24 SXM | Airoboros 3.1.2 70B q4_K_M | 49324 | 3968 | 128 | 41.29 | 8.74 |
Q8_0
Hardware | Model | Memory | Prompt | Inference | prompt t/s | token/s |
---|---|---|---|---|---|---|
A100-80 SXM | Airoboros 3.1.2 70B q8_0 | 73254 | 3968 | 128 | 549.57 | 17.09 |
6xV100-16 SXM | Airoboros 3.1.2 70B q8_0 | 78996 | 3968 | 128 | 33.92 | 12.73 |
8xV100-16 SXM | Airoboros 3.1.2 70B q8_0 | 81466 | 3968 | 128 | 27.92 | 10.18 |
4xL4-24 SXM | Airoboros 3.1.2 70B q8_0 | 76916 | 3968 | 128 | 67.54 | 8.57 |
8xL4-24 SXM | Airoboros 3.1.2 70B q8_0 | 80490 | 3968 | 128 | 41.00 | 8.51 |
V100 is faster batch=1, but L4 is 2-3X faster for prefill
Qwen fine tune
8xL4 = 143808 MiB usage 78%/card ds2: 101.7h
8xV100=124944 MiB usage 95%/card ds2: keeps erroring out, gave up…
axolotl fine tune
V100
- fp16
- Have to disable flash attention
- deepspeed zero3
accelerate launch -m axolotl.cli.train openhermes25-axolotl-5.yml --deepspeed axolotl/deepspeed/zero3.json
Still OOM, set sequence 8192→4096 and see if that helps, nope
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.00 GiB. GPU 3 has a total capacty of 15.77 GiB of which 363.88 MiB
is free. Including non-PyTorch memory, this process has 15.41 GiB memory in use. Of the allocated memory 13.72 GiB is allocated by PyTo
rch, and 148.20 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to
avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I believe needs about 160GB of RAM?
llm-jp-eval
For 7B we can run evals on a single card. There is no scaling penalty on the L4s, but there is on the V100
L4
# 1xL4
real 30m8.486s
user 29m56.266s
sys 0m35.157s
# Trying 2 just b/c GPU was at 99% on 1 card
# 2xL4
real 29m32.763s
user 29m46.183s
sys 0m8.207s
# just in case, but no difference
# 8 x L4
real 29m29.640s
user 29m41.345s
sys 0m13.537s
V100
1 X V100
real 15m40.791s
user 15m28.559s
sys 0m20.066s
2 x V100
real 20m14.239s
user 19m32.048s
sys 0m15.120s