Fine Tuning Mistral

We'll try to fine tune Mistral 7B.

Training Details

The Mistral AI Discord has a #finetuning channel which has some info/discussion:

dhokas: here are the main parameters we used for the instruct model : optimizer: adamw, max_lr: 2.5e-5, warmup steps: 50, total steps: 1250, seqlen: 32K, dropout: 0.2, world_size: 8

dhokas: Peak lr.

dhokas: Dropout added after the ffn layer

dhokas: Dropout does not make a huge difference iirc

autotune-advanced

If you just want to try a finetune, this is pretty dead simple:

pip install git+https://github.com/huggingface/transformers
pip install autotrain-advanced

time autotrain llm \
        --train \
        --model "/models/llm/hf/mistralai_Mistral-7B-Instruct-v0.1" \
        --data-path timdettmers/openassistant-guanaco \
        --use-peft \
        --use-int4 \
        --lr 2e-4 \
        --batch-size 4\
        --epochs 1 \
        --trainer sft \
        --project-name m7b-guanaco \
        --target-modules q_proj,v_proj

Note: looking at the code, it only has a single --text_column so it's limited in the types of datasets it can process (no chat or other multipart instruction datasets)?

Airoboros

We use Jon Durbin's QLora fork for training Airoboros (has the Airoboros dataset/instruction formatting):

We also start with the latest gist available for the training script and deepspeed.json:

Starting from a clean env:

# The basics
mamba create -n airoboros
mamba install pip
mamba install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia

# This is required but not in the requirements.txt
pip install packaging
# If you don't have this and need flash_attn you will be sad
# https://github.com/facebookresearch/xformers/issues/550#issuecomment-1331715980
pip install ninja

We will want to modify requirements.txt. I remove flash-attn as 1) it forces a compile w/ PyTorch Nightly, but 2) Mistral requires the latest version to work from my understanding.

# remove flash-attn then
pip install -r requirements.txt

Once we finish, we need our Mistral compatible libs:

pip install flash-attn
pip install git+https://github.com/huggingface/transformers

At this point we need to poke around with the train.sh script. Here I largely stuck w/ Jon's Airoboros training scheme vs Mistral's instruction settings:

export BASE_DIR="."
export WANDB_PROJECT=mistral-instruct-7b-airoboros-2.2.1

torchrun --nnodes=1 --nproc_per_node=2 qlora.py \
  --model_name_or_path "/models/llm/hf/mistralai_Mistral-7B-Instruct-v0.1" \
  --output_dir $BASE_DIR/$WANDB_PROJECT \
  --num_train_epochs 5 \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 25 \
  --save_total_limit 1 \
  --data_seed 11422 \
  --evaluation_strategy no \
  --eval_dataset_size 0.001 \
  --max_new_tokens 4096 \
  --dataloader_num_workers 3 \
  --logging_strategy steps \
  --remove_unused_columns False \
  --do_train \
  --double_quant \
  --quant_type nf4 \
  --bits 4 \
  --bf16 \
  --dataset $BASE_DIR/instructions.jsonl \
  --dataset_format airoboros \
  --model_max_len 4096 \
  --per_device_train_batch_size 1 \
  --learning_rate 0.000022 \
  --lr_scheduler_type cosine \
  --warmup_ratio 0.005 \
  --weight_decay 0.0 \
  --seed 11422 \
  --report_to wandb \
  --deepspeed deepspeed-7b.json \
  --gradient_checkpointing True \
  --ddp_find_unused_parameters False \
  --max_memory_MB 24000

You can add, but maybe I'll do that next time (or just do it separately)

  --do_eval
  --do_mmlu_eval

Sharing the train/loss chart: https://api.wandb.ai/links/augmxnt/eznbmx2x

TODO

Improve smoothness of train/loss:

On full 4096 context

  • 7B uses 10gb
  • 13B uses 15.6gb
  • 34B uses little more than 24gb
  • 70 uses 65gb

Packages

Examples


Revision #8
Created 28 September 2023 17:14:18 by lhl
Updated 29 September 2023 10:43:13 by lhl