Trainers https://github.com/OpenAccess-AI-Collective/axolotl
RTX 4090
RTX 3090
7900 XTX
wtf hipblas! https://github.com/ROCm/rocBLAS/issues/1339
Compile is 15% Faster
21.247GB torch.optim.AdamW (bitsandbytes doesnβt work for ROCm) max_seq_len: 2048 batch_size: 1 gradient_accumulation_steps: 8 4h
21.247GB torch.optim.AdamW (bitsandbytes doesnβt work for ROCm) max_seq_len: 2048 batch_size: 1 gradient_accumulation_steps: 16 4h
21.247GB torch.optim.AdamW (bitsandbytes doesnβt work for ROCm) max_seq_len: 2048 batch_size: 1 gradient_accumulation_steps: 64 4h
21.320GB torch.optim.AdamW (bitsandbytes doesnβt work for ROCm) max_seq_len: 2048 batch_size: 1 gradient_accumulation_steps: 64 compile: True 3h
26.946GB torch.optim.AdamW (bitsandbytes doesnβt work for ROCm) max_seq_len: 4096 batch_size: 1 gradient_accumulation_steps: 64 3:40
44.677Gi torch.optim.AdamW (bitsandbytes doesnβt work for ROCm) max_seq_len: 8192 batch_size: 1 gradient_accumulation_steps: 64 x
38.337 torch.optim.AdamW (bitsandbytes doesnβt work for ROCm) max_seq_len: 4096 batch_size: 2 gradient_accumulation_steps: 64 x
38.337 torch.optim.AdamW (bitsandbytes doesnβt work for ROCm) max_seq_len: 4096 batch_size: 2 gradient_accumulation_steps: 64 x
26.734 torch.optim.AdamW (bitsandbytes doesnβt work for ROCm) max_seq_len: 2048 batch_size: 2 gradient_accumulation_steps: 64 3:20
? bs=4 torch.optim.AdamW (bitsandbytes doesnβt work for ROCm) max_seq_len: 2048 batch_size: 2 gradient_accumulation_steps: 64
compiled version
210