List of Evals
- MosaicML Model Gauntlet - 34 benchmarks in 6 categories
-
HuggingFace Open LLM Leaderboard
- warning, their MMLU results are wrong, throwing off the whole ranking: https://twitter.com/Francis_YAO_/status/1666833311279517696
- LMSys Chatbot Arena Leaderboard - ELO style ranking
- LLM-Leaderboard
- Gotzmann LLM Score v2 (discussion)
- Chain-of-Thought Hub
- C-Eval Leaderboard
- AlpacaEval Leaderboard
- YearZero's LLM Logic Tests
- HELM Core Scenarios
- TextSynth Server
- llm-jeopardy - automated quiz show answering
- Troyanovsky/Local-LLM-comparison - one person's testing on his own standardized eval against different community models (fine-tuned quants)
- LLM Logic Tests
- Asking 60+ LLMs a set of 20 questions
- paperswithcode (based off of numbers published in paper, not independently verified or standardized)
See also:
Code
- Big Code Models Leaderboard - HF style leaderboard
- Zero-Shot Replication Framework - replicate HumanEval, LeetCodeSparks, LeetCode100
- code-eval - scripts for running/reproducing human-eval scores on models
- llm-humaneval-benchmarks - HuggingFace models evald vs HumanEval+
- Multilingual Code Models Evaluation - base multilingual code generation models
- CanAiCode Leaderboard - using Can AI Code? eval
- airate - C++ bug catching test
- phi-1 prompt tests
Roleplay
New
- EvalPlus 0.2 - https://github.com/evalplus/evalplus/releases/tag/v0.2.0
- MMMU
- https://huggingface.co/papers/2311.16502
- https://arxiv.org/pdf/2311.16502.pdf
- https://github.com/MMMU-Benchmark/MMMU
- https://mmmu-benchmark.github.io/
- https://twitter.com/xiangyue96/status/1729698316554801358
- GAIA: a benchmark for General AI Assistants
- https://arxiv.org/abs/2311.12983
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark
- https://arxiv.org/abs/2311.12022
- AgentBench
- https://github.com/THUDM/AgentBench
- https://github.com/THUDM/AgentBench#leaderboard
- Skill-Mix
- https://arxiv.org/abs/2310.17567
- Flash HELM
- https://arxiv.org/abs/2308.11696
- FLASK
- https://twitter.com/SeonghyeonYe/status/1682209670302408705
- https://arxiv.org/abs/2307.10928
- https://kaistai.github.io/FLASK/
No Comments