Our Notebook: https://github.com/AUGMXNT/llm-judge/blob/main/analyze.ipynb
See also our export: https://github.com/AUGMXNT/llm-judge/blob/main/analyze.html
Reference Notebooks:
- https://github.com/dmitrymailk/ru_lm/blob/73dbb763d2618af586c5798c777dd85dc5edf40f/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/FastChat/fastchat/llm_judge/noteboooks/mt_bench_radar.ipynb
- https://github.com/PageIV/llama-rabbit/blob/fe30651134d180a5ab90f3ba168f8afde019ca7b/eval/mt_bench/example.ipynb#L4
- https://github.com/truesoni/llama_index/blob/a31b12796248f6d1fc1eb9b71c964d9b2567716a/docs/examples/evaluation/mt_bench_human_judgement.ipynb#L66
- https://github.com/sagerpascal/deepspeed-llm/blob/0d1243fc4b35f5d7dff22aa9f92956f2e16116ec/eval/mt_bench_radar.ipynb
- https://github.com/run-llama/llama_index/blob/6c6f586322b088bcae9005e0a704e9bc4d205055/docs/examples/evaluation/mt_bench_human_judgement.ipynb#L8
- https://github.com/LLM-Tuning-Safety/LLMs-Finetuning-Safety/blob/87094e4192a7ee5b25713a14b590e5a4c7949d61/gpt-3.5/mt_bench_evaluation.ipynb#L19
- https://github.com/run-llama/llama_index/blob/6c6f586322b088bcae9005e0a704e9bc4d205055/docs/examples/evaluation/mt_bench_single_grading.ipynb#L8
- https://github.com/nabenabe0928/meta-learn-tpe/blob/72874dcd1c3cd9aab46bb4bca59b46d379497b07/viz/viz_dataset_dist.py#L7
- https://github.com/dnbaker/dashing2-experiments/blob/8fb95662863464ec72a9b9951ec64d8943132a4e/allpairs/exhaustive/plot_mt_benchmark.py#L4