2024-01: Towards the end of 2023, the HuggingFace Open LLM Leaderboard has driven a trend of benchmark gaming that makes it (and other synthetic benchmarks) largely unuseful for testing model capabilities.
The current consensus is that LMSys Chatbot Arena which allows users to compare responses and choose a winner (and does ELO style ranking) is the current gold standard for ranking general model performance. Obviously, this is has its own weaknesses (biased by audience and use case, colored by refusals or other considerations), but it’s seems to be the best we’ve got atm.
- LMSys is also collecting datasets from this chat data which could be extremely useful for training
- u/DontPlanToEnd posted a benchmark correlation analysis on 2023-12-30 which showed MT-Bench (GPT-4 judged benchmark) had 0.89 correlation with Chatbot Arena, making it probably the 2nd best score. (MMLU has 0.85 correlation)
- u/WolframRavenwolf has been posting his own LLM Comparison/Tests of new models which is pretty interesting - it tests in German (almost assuredly out of distribution) and focuses on instruction following, but is a good sanity check
General
- Livebench - refreshed every month (6 month question rotation)
- MixEval
- IFEval
- RULER
- Berkeley Function-Calling Leaderboard
- ZeroEval Leaderboard -project source, description: https://threadreaderapp.com/thread/1814037110577578377.html
- MMLU-redux (
-d mmlu-redux
) - knowledge - GSM8K (
-d gsm
) - math - ZebraLogic (
-d zebra-grid
) - logic - CRUX (
-d crux
) - code
- MMLU-redux (
- MosaicML Model Gauntlet - 34 benchmarks in 6 categories
- HuggingFace Open LLM Leaderboard
- warning, their MMLU results are wrong, throwing off the whole ranking: https://twitter.com/Francis_YAO_/status/1666833311279517696
- LMSys Chatbot Arena Leaderboard - ELO style ranking
- LLM-Leaderboard
- Gotzmann LLM Score v2 (discussion)
- Chain-of-Thought Hub
- C-Eval Leaderboard
- AlpacaEval Leaderboard
- YearZero’s LLM Logic Tests
- HELM Core Scenarios
- TextSynth Server
- llm-jeopardy - automated quiz show answering
- Troyanovsky/Local-LLM-comparison - one person’s testing on his own standardized eval against different community models (fine-tuned quants)
- LLM Logic Tests
- Asking 60+ LLMs a set of 20 questions
- paperswithcode (based off of numbers published in paper, not independently verified or standardized)
- YALL
Price Perf
Running Your Own
Do your own Chat Arena! https://github.com/Contextualist/lone-arena
promptfoo: https://www.promptfoo.dev/docs/intro/
See also:
Hallucination
Vectara Hallucination Leaderboard
Code
See Code Evaluation for code evals.
Roleplay
Context
InfiniteBench
- https://github.com/OpenBMB/InfiniteBench
- https://huggingface.co/datasets/xinrongzhang2022/InfiniteBench
- https://www.reddit.com/r/LocalLLaMA/comments/18ct9xh/infinitebench_100k_longcontext_benchmark/
Japanese
New
- https://www.reddit.com/r/LocalLLaMA/comments/1945tfv/challenge_llms_to_reason_about_reasoning_a/
- DiagGSM8K
- EvalPlus 0.2 - https://github.com/evalplus/evalplus/releases/tag/v0.2.0
- MMMU
- GAIA: a benchmark for General AI Assistants
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark
- AgentBench
- Skill-Mix
- Flash HELM
- ARB: Advanced Reasoning Benchmark for Large Language Models
- FLASK
Contamination
- Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
- https://huggingface.co/blog/rishiraj/merge-models-without-contamination
- https://opencompass.readthedocs.io/en/latest/advanced_guides/contamination_eval.html