📖 llm-tracker

Search

❯

❯

List of Evals

Aug 14, 2024, 3 min read

2024-01: Towards the end of 2023, the HuggingFace Open LLM Leaderboard has driven a trend of benchmark gaming that makes it (and other synthetic benchmarks) largely unuseful for testing model capabilities.

The current consensus is that LMSys Chatbot Arena which allows users to compare responses and choose a winner (and does ELO style ranking) is the current gold standard for ranking general model performance. Obviously, this is has its own weaknesses (biased by audience and use case, colored by refusals or other considerations), but it’s seems to be the best we’ve got atm.

LMSys is also collecting datasets from this chat data which could be extremely useful for training
u/DontPlanToEnd posted a benchmark correlation analysis on 2023-12-30 which showed MT-Bench (GPT-4 judged benchmark) had 0.89 correlation with Chatbot Arena, making it probably the 2nd best score. (MMLU has 0.85 correlation)
u/WolframRavenwolf has been posting his own LLM Comparison/Tests of new models which is pretty interesting - it tests in German (almost assuredly out of distribution) and focuses on instruction following, but is a good sanity check

General

Livebench - refreshed every month (6 month question rotation)
MixEval
IFEval
RULER
Berkeley Function-Calling Leaderboard
ZeroEval Leaderboard -project source, description: https://threadreaderapp.com/thread/1814037110577578377.html
- MMLU-redux (-d mmlu-redux) - knowledge
- GSM8K (-d gsm) - math
- ZebraLogic (-d zebra-grid) - logic
- CRUX (-d crux) - code
MosaicML Model Gauntlet - 34 benchmarks in 6 categories
HuggingFace Open LLM Leaderboard
- warning, their MMLU results are wrong, throwing off the whole ranking: https://twitter.com/Francis_YAO_/status/1666833311279517696
LMSys Chatbot Arena Leaderboard - ELO style ranking
LLM-Leaderboard
Gotzmann LLM Score v2 (discussion)
Chain-of-Thought Hub
C-Eval Leaderboard
AlpacaEval Leaderboard
YearZero’s LLM Logic Tests
HELM Core Scenarios
TextSynth Server
llm-jeopardy - automated quiz show answering
Troyanovsky/Local-LLM-comparison - one person’s testing on his own standardized eval against different community models (fine-tuned quants)
LLM Logic Tests
Asking 60+ LLMs a set of 20 questions
paperswithcode (based off of numbers published in paper, not independently verified or standardized)
- MMLU
- ARC-c
- Hellaswag
- Winogrande
- BoolQ
- HumanEval
YALL
- https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard
- https://github.com/mlabonne/llm-autoeval

https://huggingface.co/collections/open-llm-leaderboard/the-big-benchmarks-collection-64faca6335a7fc7d4ffe974a

Price Perf

https://artificialanalysis.ai/
https://leaderboard.withmartian.com/

Running Your Own

Do your own Chat Arena! https://github.com/Contextualist/lone-arena

promptfoo: https://www.promptfoo.dev/docs/intro/

See also:

OpenAI Evals
Papers and resources for LLMs evaluation
2023-08-17 HN Discussion on evals

Hallucination

Vectara Hallucination Leaderboard

Code

See Code Evaluation for code evals.

Roleplay

Another LLM Roleplay Rankings
The ‘Ayumi’ Inofficial LLM ERP Model Rating

Context

InfiniteBench

https://github.com/OpenBMB/InfiniteBench
https://huggingface.co/datasets/xinrongzhang2022/InfiniteBench
https://www.reddit.com/r/LocalLLaMA/comments/18ct9xh/infinitebench_100k_longcontext_benchmark/

Japanese

See https://github.com/AUGMXNT/shisa/wiki/Evals

New

https://www.reddit.com/r/LocalLLaMA/comments/1945tfv/challenge_llms_to_reason_about_reasoning_a/
- DiagGSM8K
EvalPlus 0.2 - https://github.com/evalplus/evalplus/releases/tag/v0.2.0
MMMU
GAIA: a benchmark for General AI Assistants
- https://arxiv.org/abs/2311.12983
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
- https://arxiv.org/abs/2311.12022
AgentBench
- https://github.com/THUDM/AgentBench
- https://github.com/THUDM/AgentBench#leaderboard
Skill-Mix
- https://arxiv.org/abs/2310.17567
Flash HELM
- https://arxiv.org/abs/2308.11696
ARB: Advanced Reasoning Benchmark for Large Language Models
- https://arxiv.org/abs/2307.13692
FLASK

Contamination

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
https://huggingface.co/blog/rishiraj/merge-models-without-contamination
https://opencompass.readthedocs.io/en/latest/advanced_guides/contamination_eval.html

Eval the Evals

Benchmarking the Benchmarks - Correlation with Human Preference
LLMs as a judge models are bad at giving scores in relevant numerical intervals

General
Price Perf
Running Your Own
Hallucination
Code
Roleplay
Context
InfiniteBench
Japanese
New
Contamination
Eval the Evals

Backlinks

Home

Created with Quartz © 2025