Post in r/localllama, finetuning

2024-06 Comparison

  • Repo
  • WandB
    • trainer-test

Set same LR, warmup, etc - should be same weight_decay, dropout

Hardware Driver version nvcc —version PyTorch

memory-high-water-4090

  • highwater
  • starting
  • max memory usage

RTX 4090

100%|█████████████████████████████████████████████| 25881|25880|Loss: 0.8484776020050049: 100%|█████████████████████████████████████████████| 25880/25880 [2:30:00<00:00, 2.88it/s]

RTX 3090

7900 XTX

W7900

19.540GiB ~280/303W

1|404|Loss: 0.8711342215538025: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 404/404 [4:43:52<00:00, 42.16s/it]

hiplast

gdb

vs Unsloth

4090 330/450W 17.178Gi ❯ time CUDA_VISIBLE_DEVICES=0 python train.py

vs autotrain

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 51,760 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 64
\        /    Total batch size = 128 | Total steps = 60
 "-____-"     Number of trainable parameters = 20,971,520

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [14:58<00:00, 14.98s/it]

real 15m20.123s user 12m34.279s sys 3m17.287s

vs Axolotl

❯ time CUDA_VISIBLE_DEVICES=0 python train.py 
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth: Fast Llama patching release 2024.6
   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 23.677 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.41it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
meta-llama/Meta-Llama-3-8B-Instruct does not have a padding token! Will use pad_token = <|reserved_special_token_250|>.
Unsloth 2024.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 51,760 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 64
\        /    Total batch size = 128 | Total steps = 60
 "-____-"     Number of trainable parameters = 20,971,520